RL-TweetGen: A Socio-Technical Framework for Engagement-Optimized Short Text Generation in Digital Commerce Using Large Language Models and Reinforcement Learning

Chitrakala S; Pavithra S S

doi:10.3390/jtaer20030218

and

Department of Computer Science and Engineering, College of Engineering Guindy, Anna University, Chennai 600025, India

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Theor. Appl. Electron. Commer. Res.2025, 20(3), 218;https://doi.org/10.3390/jtaer20030218

Version Notes

Order Reprints

Abstract

In the rapidly evolving landscape of digital marketing and electronic commerce, short-form content—particularly on platforms like Twitter (now X)—has become pivotal for real-time branding, community engagement, and product promotion. The rise of Non-Fungible Tokens (NFTs) and Web3 ecosystems further underscores the need for domain-specific, engagement-oriented social media content. However, automating the generation of such content while balancing linguistic quality, semantic relevance, and audience engagement remains a substantial challenge. To address this, we propose RL-TweetGen, a socio-technical framework that integrates instruction-tuned large language models (LLMs) with reinforcement learning (RL) to generate concise, impactful, and engagement-optimized tweets. The framework incorporates a structured pipeline comprising domain-specific data curation, semantic classification, and intent-aware prompt engineering, and leverages Parameter-Efficient Fine-Tuning (PEFT) with LoRA for scalable model adaptation. We fine-tuned and evaluated three LLMs—LLaMA-3.1-8B, Mistral-7B Instruct, and DeepSeek 7B Chat—guided by a hybrid reward function that blends XGBoost-predicted engagement scores with expert-in-the-loop feedback. To enhance lexical diversity and contextual alignment, we implemented advanced decoding strategies, including Tailored Beam Search, Enhanced Top-p Sampling, and Contextual Temperature Scaling. A case study focused on NFT-related tweet generation demonstrated the practical effectiveness of RL-TweetGen. Experimental results showed that Mistral-7B achieved the highest lexical fluency (BLEU: 0.2285), LLaMA-3.1 exhibited superior semantic precision (BERT-F1: 0.8155), while DeepSeek 7B provided balanced performance. Overall, RL-TweetGen presents a scalable and adaptive solution for marketers, content strategists, and Web3 platforms seeking to automate and optimize social media engagement. The framework advances the role of generative AI in digital commerce by aligning content generation with platform dynamics, user preferences, and marketing goals.

Keywords:

large language models (LLMs); reinforcement learning; generative AI; social media content generation; engagement optimization; digital marketing; NFT communication; Web3 platforms

1. Introduction

Short text generation is becoming increasingly critical in electronic commerce, where real-time communication and engagement play a central role in marketing, branding, and community building. Platforms such as Twitter (now X), Instagram, and Reddit have transformed interactions between consumers and organizations, relying heavily on short-form content like tweets, captions, and announcements to convey information quickly and effectively. Unlike long-form narratives, short texts must be concise, emotionally resonant, and optimized for platform constraints, including strict character limits and evolving linguistic norms. Micro-content must also adapt to audience preferences, trending topics, and time-sensitive events—making automation in this domain both technically challenging and commercially valuable.

From a societal perspective, short text generation enables high-impact applications in domains such as emergency communication, sentiment monitoring, and political discourse analysis. A particularly dynamic and commercially significant use case is the Non-Fungible Token (NFT) ecosystem, which has rapidly evolved since 2020 into a core pillar of the digital economy. NFTs intersect with diverse domains including digital art, gaming, collectibles, and metaverse assets, with platforms like Twitter serving as the primary channel for community engagement and promotional activity.

Recent advances in generative artificial intelligence—particularly large language models (LLMs) such as GPT [], BLOOM [], LLaMA [], Mistral [], DeepSeek [], and Claude []—have significantly improved machine-generated text quality. Domain-specific models like BERTweet and TwHIN-BERT [,] further optimize generation for the informal, emoji-rich language of social platforms.

However, existing LLM-based generation techniques do not employ reinforcement-based optimization grounded in real-world engagement metrics, and most fail to adapt to NFT-specific terminology, style, and community expectations. Engagement indicators such as likes, replies, and retweets are rarely included in generation objectives, and reinforcement learning remains underutilized for optimizing social media content toward concrete performance goals.

To address these limitations, we propose RL-TweetGen, an AI-powered socio-technical framework that integrates large language models with reinforcement learning to automate short-form content generation optimized for engagement. RL-TweetGen combines:

Domain-specific dataset construction and intent-aware prompt engineering;
Modular fine-tuning using Low-Rank Adaptation (LoRA);
Multi-dimensional output control via a Length–Style–Context (LSC) Variation Algorithm;
Reinforcement learning optimization guided by engagement-prediction models and expert feedback.

Research Contributions

RL-TweetGen presents a novel end-to-end system for generating high-quality, semantically aligned, and engagement-optimized tweets, specifically tailored for NFT marketing, branding, and community engagement. The key contributions of the system are outlined below:

Domain-Specific Dataset Construction: Curated NFT-related tweets with structured prompts, labeled metadata, and engagement metrics for model training and evaluation.
Intent-Aware Prompt Engineering: Semantic classification combined with domain expertise to generate structured, goal-aligned prompts.
Contextual and Semantic Fine-Tuning: Fine-tuning three instruction-tuned LLMs—LLaMA 3.1–8B, Mistral 7B, and DeepSeek 7B—using LoRA to align outputs with NFT-specific language and tone.
LSC Variation Algorithm: A novel mechanism for generating diverse outputs across length, style, and context dimensions for targeted audience engagement.
Reinforcement Learning Engagement Optimizer (RLEO): A blended reward function combining engagement score predictions (via XGBoost) and expert feedback, integrated with advanced decoding strategies (Tailored Beam Search, Enhanced Nucleus Sampling, and Contextual Temperature Scaling).

Together, these components enable RL-TweetGen to produce domain-aware, style-controllable, and engagement-optimized tweets, addressing a key gap in existing RL-NLP systems for digital commerce.

The remainder of this paper is organized as follows: Section 2 reviews the literature on short-text generation, LLMs, reinforcement learning, and engagement optimization in digital commerce. Section 3 details the RL-TweetGen methodology, including architecture, data curation, and model tuning. Section 4 describes the experimental setup and reinforcement learning strategy. Section 5 presents the evaluation results and analysis. Section 7 includes representative use cases. Section 8 concludes with key contributions and future directions.

2. Related Work

This section provides a comprehensive review of the key research areas that underpin the development of the proposed RL-TweetGen. It examines four interconnected domains: the evolution of short-text-generation techniques, the advancement and domain adaptation of large language models (LLMs), the application of reinforcement learning for goal-directed text generation, and the engagement dynamics driven by social media within the NFT ecosystem. Collectively, these areas establish both the technical foundation and the socio-commercial rationale for building an engagement-optimized, domain-aware tweet-generation system for digital commerce platforms.

2.1. Evolution of Short-Text Generation

Short-text generation has progressed from rule-based systems to sophisticated neural models capable of generating concise and contextually appropriate content. Early approaches relied on templates, statistical language models (e.g., n-grams), and Hidden Markov Models (HMMs). While these techniques produced grammatically valid sentences, they lacked semantic richness, adaptability, and user-centric tone—key requirements for micro-content on platforms like Twitter and Instagram.

The introduction of Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks significantly improved sequence modeling and long-range dependency handling []. Yet, these models still exhibited limitations in capturing global context, especially for shorter, multi-intent messages. The advent of Transformer-based models marked a paradigm shift, enabling more robust contextual representation via self-attention mechanisms and parallel processing []. This breakthrough gave rise to general-purpose architectures that could be repurposed across NLP tasks, including summarization, question answering, and short-text generation.

Further innovations introduced controllability and semantic grounding into generation. Approaches like semantic-key-conditioned generation [], style-conditioned decoding, and emotion-aware microtext generation [,] improved output alignment with user intent. Evolutionary algorithms and neuro-symbolic hybrids have also been employed to optimize generation strategies for domain-specific tasks, including low-resource language processing and context-sensitive advertising [,].

While transformer-based models have greatly improved the fluency and contextual relevance of short-text generation, they often lack mechanisms to align outputs with task-specific objectives such as engagement optimization. Prior work on style-conditioned and emotion-aware generation has shown that microtext can be tailored for sentiment or tone, but these methods typically overlook real-time user interaction data and commercial performance indicators. RL-TweetGen extends these advances by integrating predictive engagement feedback directly into the generation process, enabling the creation of concise, personalized, and engagement-driven content that fits within platform character limits while preserving semantic and emotional resonance.

2.2. Large Language Models and Domain Adaptation

Large Language Models (LLMs) such as OpenAI’s GPT, Meta’s LLaMA, Mistral AI’s Mistral, and DeepSeek have revolutionized natural language generation. Trained on massive web-scale corpora, these models achieve state-of-the-art performance in zeroshot and few-shot scenarios across diverse tasks. However, their generality can be a limitation: without domain-specific adaptation, outputs may exhibit hallucinations, lack context awareness, or deviate from user expectations in specialized settings such as NFTs or digital marketing [,].

To address these challenges, fine-tuning strategies have been developed to tailor pre-trained models to specific application areas. Instruction tuning aligns outputs with task phrasing, while supervised fine-tuning with labeled examples improves task consistency. Parameter-efficient approaches—such as Low-Rank Adaptation (LoRA) [], adapters, and prompt tuning—reduce the computational and memory costs of adaptation, making customization feasible even on consumer-grade hardware [,]. Extensions like D²LoRA [] further enhance efficiency in low-resource settings, enabling effective fine-tuning for domain-specific tasks such as title generation and healthcare communication.

Beyond supervised methods, alternative alignment techniques have emerged. Direct Preference Optimization (DPO) [] offers a reinforcement-free alternative to RLHF, either standalone or integrated with LoRA, and has shown benefits in reducing bias and improving relevance in applications such as e-commerce search. In RL-TweetGen, LoRA-based fine-tuning is combined with domain-informed prompt engineering and reinforcement learning to inject NFT-specific terminology, hashtags, slang, and discourse patterns into the generation pipeline.

Evaluating domain-adapted LLMs requires more than surface-level lexical metrics. Semantic-aware measures such as BERTScore [] and BLEURT [] capture contextual fidelity, while traditional metrics like ROUGE-L remain useful for structural comparison. These automated measures are often complemented by human-in-the-loop assessments to ensure alignment with domain expectations and engagement objectives.

Despite these advances, several challenges remain, including overfitting, domain drift, and maintaining robustness under evolving platform norms. Continuous validation, real-time feedback integration, and iterative training loop optimization are essential for reliable deployment in dynamic environments like social media. Moreover, the ethical implications of optimizing content solely for engagement cannot be overlooked. Prior work in persuasive computing [] and socio-technical AI design [] highlights risks such as manipulative algorithmic behavior, bias reinforcement, and clickbait tendencies—underscoring the need for ethically grounded generative systems.

In summary, while existing LLMs offer powerful general-purpose generation capabilities, they often lack domain specificity in highly contextual environments like NFTs. Domain adaptation methods such as instruction tuning, LoRA, and adapters improve linguistic alignment but rarely address downstream impact on audience engagement. RL-TweetGen bridges this gap by combining domain-specific fine-tuning with engagement-aware reinforcement learning, ensuring both linguistic fidelity and measurable performance in social media marketing.

2.3. Reinforcement Learning for Controlled and Goal-Aware Generation

Reinforcement Learning (RL) has emerged as a powerful paradigm for aligning text generation with human-defined objectives such as factual accuracy, safety, persuasion, and user engagement. Unlike supervised fine-tuning, which focuses on replicating labeled examples, RL optimizes models based on outcome-driven feedback, enabling adaptive and goal-oriented content generation.

Reinforcement Learning from Human Feedback (RLHF) has been central to this evolution, using reward signals from preference comparisons, annotation scores, or proxy metrics (e.g., engagement statistics) to guide optimization []. This approach has been applied to improve helpfulness, reduce toxicity, and align tone with social norms. More recent methods, such as Direct Preference Optimization (DPO) and Reinforced Self-Training (ReST), reduce reliance on human labels by synthesizing high-quality training data from model outputs, thereby improving sample efficiency and scalability.

For short-text generation, RL supports optimization for complex, multi-objective goals—maximizing engagement (likes, retweets), adhering to platform constraints, and matching stylistic preferences. Custom reward functions, such as those used in RL-TweetGen, blend predictive engagement models (e.g., XGBoost-based scores) with domain expert evaluations to steer generation toward real-world effectiveness. However, challenges remain, including training instability, reward sparsity, credit assignment in short sequences, and the computational cost of exploration.

While prior RL-based text-generation systems have primarily targeted safety, helpfulness, or factuality, they often overlook domain-specific engagement factors and user behavior modeling. RL-TweetGen addresses this gap by defining a custom reward function grounded in actual Twitter engagement metrics (likes, retweets) and domain-specific sentiment signals, aligning the RL signal with real-world success criteria in NFT communities.

2.4. NFTs, Social Platforms, and Community Behavior

Non-Fungible Tokens (NFTs) have rapidly evolved from niche digital assets into a major pillar of the decentralized digital economy. Their use spans digital art, gaming, collectibles, identity, and metaverse assets. As NFTs gained traction, platforms like Twitter emerged as the central communication hub for creators, collectors, influencers, and marketplaces. These platforms are used not only for announcements and marketing but also for managing sentiment, building trust, and coordinating community behavior [,].

Studies show that tweet engagement (likes, comments, and retweets) directly affects NFT visibility and valuation. Emotions like excitement, trust, and curiosity are linked to increased user interaction and price fluctuations []. The social layer of NFTs—via shared memes, cultural references, and reply threads—forms a participatory ecosystem where community feedback loops shape success. However, these dynamics are vulnerable to manipulation through bot amplification, influencer-led hype cycles, and misinformation [].

Transparency, authenticity, and ethical communication have therefore become critical for sustained community trust []. Despite this, most generative NLP systems lack the granularity to interpret or generate content that aligns with NFT-specific discourse and audience expectations. There is a growing need for domain-aware generative tools that adapt to evolving vocabulary, meme culture, and sentiment trends.

By using NFTs as a real-world evaluation domain, RL-TweetGen demonstrates how socio-technical alignment—between language models, social context, and platform norms—can improve automated communication in commerce. Tweets announcing drops, responding to followers, or celebrating milestones are not just content—they are signals in a trust-based market.

While prior studies have explored how social media sentiment influences NFT valuation [,], few have translated these insights into generative modeling techniques. Moreover, traditional content-generation tools fail to engage with the fast-paced, meme-driven, and community-centric nature of NFT discourse. RL-TweetGen directly responds to this gap by combining domain-adapted LLMs with real-time engagement signals, making it capable of producing socially resonant content that aligns with cultural and emotional expectations of NFT communities. This socio-technical alignment is critical for trust-building and visibility in decentralized digital economies.

2.5. Toward an Integrated Framework for Engagement-Optimized Generation

While considerable progress has been made in short-text modeling, domain-specific fine-tuning, and RL-based control, most existing systems operate in silos. There is a lack of unified frameworks that combine all three elements in a scalable, modular, and socially grounded manner. Additionally, few systems incorporate engagement prediction directly into the learning loop, despite its clear importance in social commerce.

RL-TweetGen addresses this gap by integrating instruction-tuned LLMs, parameter-efficient domain adaptation, and reward-based optimization into a unified pipeline. Its evaluation within the NFT domain—characterized by emotional volatility, content saturation, and fast trend cycles—offers insights into how generative AI can be made responsive to both social and commercial imperatives. Existing research often treats text generation, domain adaptation, and engagement modeling as separate concerns. By contrast, RL-TweetGen integrates these three threads into a unified, modular architecture designed for high-impact content creation in dynamic commercial settings. It operationalizes insights from prior work—such as fine-tuning efficiency, reward-based optimization, and NFT community behavior—into a cohesive pipeline. This synthesis enables RL-TweetGen to generate not just grammatical or domain-correct content, but content that is strategically optimized for interaction, relevance, and ethical resonance within the social web.

The design of RL-TweetGen is grounded in five interrelated research areas, each contributing a critical dimension to its overall architecture. Advances in short-text generation provide the foundational modeling strategies for producing concise and coherent content. Building on this, developments in large language models (LLMs) and domain adaptation enable the deployment of instruction-tuned models tailored specifically for NFT-related discourse. Reinforcement learning frameworks introduce mechanisms for optimizing tweet generation based on engagement-oriented reward signals and expert feedback, aligning outputs with user interaction goals. Insights from the social dynamics of NFT communities further inform the system’s emphasis on stylistic variability, credibility markers, and audience sensitivity. These elements converge within a modular architecture designed to integrate and operationalize these research strands into a scalable, flexible, and engagement-aware framework.

More broadly, RL-TweetGen contributes to a growing movement toward responsible AI for digital commerce. By enabling controllable, engagement-driven, and ethically guided generation, the system advances the design of communication technologies that are not only linguistically fluent but also socially resonant and commercially effective.

3. Methodology

This section outlines the socio-technical architecture and implementation strategy of the proposed RL-TweetGen system—an integrated framework for generating engagement-optimized tweets in the NFT domain. The system is composed of multiple interlinked modules, including data collection, semantic classification, structured prompt generation, large language model (LLM) fine-tuning, decoding strategy selection, reinforcement learning-based optimization, and iterative output refinement. The complete system architecture is depicted in Figure 1.

Figure 1. The overall architecture of the proposed RL-TweetGen system.

3.1. Dataset Acquisition, Characteristics, and Preprocessing

The dataset used in this study was derived from the publicly available “Verified NFT Tweets” collection on Kaggle [], spanning tweets posted between 2020 and 2022. It provides a rich corpus of NFT-related tweets accompanied by detailed metadata and engagement statistics. All data usage strictly complied with the dataset’s Creative Commons Attribution–NonCommercial–ShareAlike 4.0 International License (CC BY-NC-SA 4.0) and Twitter’s data redistribution policy. No data was collected via the Twitter API, and the dataset was used solely for non-commercial, academic research purposes. A curated subset of 2134 high-quality tweets was selected for analysis and model development.

Each tweet entry includes metadata such as tweet ID, timestamp, username, tweet content, language, mentioned users, hashtags, cashtags, quoted users, engagement indicators (likes, retweets, replies), and media presence flags. These features enable comprehensive semantic and contextual modeling, supporting both generation and engagement-scoring tasks.

3.1.1. Preprocessing Pipeline

A multi-stage preprocessing pipeline was implemented to ensure textual quality, linguistic uniformity, and format consistency:

Duplicate Removal: Identical tweets were removed to reduce redundancy and prevent data leakage.
Short Text Filtering: Tweets with fewer than 20 characters were excluded to eliminate low-information posts.
Language Filtering: The langid library retained only English-language tweets to ensure linguistic consistency.
Encoding Normalization: A Latin-1 to UTF-8 fallback decoding procedure was applied to resolve encoding artifacts and recover any corrupted text. The conversion re-encodes tweet content into UTF-8, ensuring compatibility with modern NLP pipelines that require UTF-8 input.
Character Cleaning: Non-ASCII and corrupted characters were removed using regular expressions to preserve tokenizability.
Whitespace and Punctuation Normalization: Irregular spacing and malformed punctuation were standardized to improve parsing and formatting.

3.1.2. Dataset Statistics

The step-by-step preprocessing results are summarized in Table 1.

Table 1. Dataset preprocessing steps and characteristics.

The resulting dataset constitutes a semantically coherent, structurally clean, and contextually diverse corpus of NFT-specific tweets, suitable for training both generative language models and engagement-prediction modules. The engagement-level stratification ensures balanced representation across low-, medium-, and high-engagement tweets, enabling fair evaluation of the proposed RL-TweetGen framework.

3.2. Engagement Score Computation and Labeling

To quantify and categorize tweet-level user engagement, a composite engagement score was computed based on user interaction metrics. The scoring formula, presented in Equation (1), assigns differential weights to likes, retweets, and replies—reflecting the relative depth of interaction, where replies are deemed most indicative of meaningful engagement.

Engagement Score = (1 \times Likes) + (2 \times Retweets) + (3 \times Replies)

(1)

The weights 1, 2, and 3 in Equation (1) represent a hierarchy of user engagement: likes denote passive approval, retweets reflect content amplification, and replies suggest deeper interaction. This heuristic captures increasing effort and intent behind each action.

Following score computation, a quantile-based binning strategy was applied to discretize engagement values into three balanced classes: Low, Medium, and High. This stratification enables effective training of classification models and facilitates multi-class evaluation.

The full preprocessing and labeling workflow is outlined in Algorithm 1.

This preprocessing and engagement annotation procedure ensures clean, structured input data for model fine-tuning and facilitates reliable performance evaluation under a supervised learning framework.

Algorithm 1 Data preprocessing and engagement scoring

1:: Input: Raw_Tweets
2:: Output: Cleaned_Tweets with Engagement Labels
3:: Load raw dataset
4:: for each tweet t in Raw_Tweets do
5:: if length(t) < 20 or t is a duplicate then
6:: Remove t
7:: else
8:: Detect language using langid; retain only English tweets
9:: Normalize encoding from Latin-1 to UTF-8
10:: Remove non-ASCII characters using regular expressions
11:: Normalize whitespace and punctuation
12:: end if
13:: end for
14:: for each tweet t in cleaned set do
15:: Compute Engagement Score: $E = 1 \cdot Likes + 2 \cdot Retweets + 3 \cdot Replies$
16:: end for
17:: Apply quantile-based binning to all E
18:: Assign label: Low, Medium, or High

3.3. Tweet Semantic Classification

To enable domain-aware NFT tweet generation, a semantic classification pipeline was designed to categorize tweets into six predefined domains: Art, Gaming, Music, Photography, Membership, and Profile Pictures (PFPs).

Annotation and Feature Extraction: Tweets were annotated using a rule-based strategy combining keyword matching with regex-based pattern detection to extract domain indicative cues. This heuristic labeling approach produced a weakly supervised training set aligned with the semantic content of each domain. For feature extraction, TF-IDF vectorization with bigrams was applied to capture discriminative token patterns relevant to NFT subject areas.
Model Training: The annotated dataset was used to train and evaluate a suite of traditional machine learning classifiers [], including Logistic Regression, Linear Support Vector Machine (SVM), Stochastic Gradient Descent (SGD) Classifier, Decision Tree, Random Forest, and Gradient Boosting.
Evaluation Strategy: Model training and evaluation were performed with stratified data splits to maintain balanced label distributions across semantic categories. Standard classification metrics were used to assess performance.
Model Selection and Prompt Construction: Among the evaluated models, the Gradient Boosting classifier achieved the highest validation accuracy (Section 5.1 for detailed results) and was selected to generate semantic labels for guiding the construction of prompts. The output of this classification pipeline consisted of structured semantic labels, which were subsequently integrated into prompt templates to ensure that generated NFT tweets reflected coherent, domain-specific content.

3.4. Intent-Aware Prompt Generation

To generate relevant, coherent, and intent-aligned NFT tweets, a structured prompt-generation module was developed. This module transformed raw tweet text into natural language prompts that conditioned the generative model on both the semantic category and the extracted textual components of each tweet. The process began with a dataset of pre-labeled tweets, each annotated with one of six semantic intent categories—Art, Gaming, Music, Photography, Membership, or Profile Pictures (PFPs)—as determined by the semantic classification pipeline. These semantic intent labels acted as guiding signals for prompt construction, enabling intent-aware generation where the communicative goal of the tweet directly influenced the prompt’s content and structure. Each tweet underwent linguistic preprocessing using NLTK and spaCy to extract structured features:

Hashtag, Mention, and URL Extraction: Up to five hashtags, user mentions, and URLs were extracted using regular expressions.
Named Entity Recognition (NER): Named entities (e.g., organizations, people, products, locations) were identified using spaCy’s en_core_web_sm Named Entity Recognition model.
Keyword Extraction: Content-bearing keywords were extracted through tokenization, stopword removal, and frequency-based filtering.

Based on each tweet’s assigned semantic category, the module mapped it to a predefined intent-specific template encoding the communicative function of the tweet. These intent-aware templates were dynamically populated with the extracted features to construct complete, interpretable prompts, ensuring that the generated tweets clearly reflected the semantic purpose of the original content. The entire prompt construction process was implemented in batch using a CSV-based pipeline, generating a parallel dataset containing:

The original tweet text;
Features;
Engagement category label;
Semantic category label;
The final structured prompt.

These prompt–tweet pairs were subsequently used as inputs for the RL-TweetGen System, which combined supervised fine-tuning with simulated reinforcement learning to train a generative model capable of producing tweets that are semantically grounded, intent-aligned, and optimized for engagement. Algorithm 2 outlines the proposed semantic and intent-aware prompt-generation process, enabling tailored input construction for NFT tweet generation.

Algorithm 2 Proposed semantic and intent-aware prompt generation

1:: Input: Cleaned_Tweet_Dataset
2:: Output: Final_Prompt_Dataset
3:: Semantic Annotation and Classification
4:: Annotate each tweet with domain labels using keyword heuristics
5:: Extract TF-IDF features with bigrams
6:: Train classifiers: Logistic Regression, SVM, SGD, Decision Tree, Random Forest, Gradient Boosting
7:: Evaluate each classifier using stratified split and metrics: accuracy, precision, recall, F1
8:: Select best-performing classifier (Gradient Boosting)
9:: Predict semantic label: {Art, Gaming, Music, Photography, Membership, PFPs}
10:: Prompt Feature Extraction
11:: for each tweet in dataset do
12:: Extract structural elements: hashtags, mentions, and URLs (up to 5)
13:: Apply Named Entity Recognition (NER) using spaCy
14:: Extract key tokens via tokenization and frequency filtering
15:: Retrieve semantic label and compute engagement level
16:: Select intent-specific prompt template based on semantic label
17:: Populate prompt with extracted features and labels
18:: end for
19:: Save constructed prompts to Final_Prompt_Dataset

3.5. Model Selection Overview

To establish a strong foundation and performance baseline for NFT tweet generation, three state-of-the-art instruction-tuned language models were selected: Mistral-7B Instruct v0.1, LLaMA-3.1-8B Instruct, and DeepSeek LLM 7B Chat. These models were chosen for their robust instruction-following capabilities, efficient low-latency generation, and adaptability to domain-specific tasks—key requirements for producing short-form, high-impact content such as NFT-related tweets. A comparative summary of their architectural characteristics, model capacities, and instruction-following strengths is presented in Table 2. This comparison underscores each model’s suitability for specific aspects of NFT tweet generation—such as contextual reasoning, sampling efficiency, and compatibility with LoRA-based fine-tuning.

Table 2. Comparison of Mistral 7B Instruct v0.1, LLaMA 3.1 8B Instruct, and DeepSeek LLM 7B chat models.

The following sections provide deeper insights into each model’s architecture and its role within the proposed RL-TweetGen framework, highlighting how their distinct capabilities contribute to stylistically consistent, semantically grounded, and engagement-optimized tweet generation.

3.5.1. Mistral-7B Instruct V0.1

Mistral-7B Instruct v0.1, developed by Mistral AI, is a 7-billion parameter decoder only transformer model optimized for fast and memory-efficient inference. Its low-latency generation capabilities make it particularly well suited for time-sensitive tasks such as NFT tweet generation, where rapid content creation and iteration are critical.

Architecture: Mistral-7B incorporates architectural innovations including Grouped Query Attention (GQA), Sliding Window Attention (SWA), and FlashAttention v2, which collectively enhance sequence processing efficiency and reduce computational overhead. These features enable the model to effectively handle short-form, contextually rich text generation while maintaining high throughput and responsiveness.
Key Features and Role in RL-TweetGen: Mistral-7B is designed to support generation of stylistically diverse outputs with minimal latency. Its efficient architecture allows for quick generation of multiple tweet variants, making it ideal for live updates, community engagement, and A/B testing in NFT marketing workflows. Within the RL-TweetGen system, these capabilities facilitate rapid experimentation and iteration, ensuring tweets are not only timely but also aligned with current trends and audience expectations. Its balance of speed, efficiency, and coherence positions Mistral-7B as a strong candidate for applications requiring high-frequency, high-quality text outputs.

3.5.2. LLaMA-3.1-8B Instruct

LLaMA-3.1-8B Instruct, developed by Meta, is an 8-billion-parameter decoder only transformer model optimized for instruction-following and extended context handling. It is well-suited for NFT tweet-generation tasks that require consistency, contextual awareness, and precise control over tone and structure.

Architecture: The model builds on the transformer architecture and incorporates extended Rotary Positional Embeddings (RoPE) and sliding window attention, allowing it to process longer and more structured inputs. These capabilities are particularly useful for generating sequenced or grouped tweets—such as mint announcements, update threads, or promotional campaigns. Additional features include layer normalization, instruction tuning, and mixed-precision training, along with full LoRA compatibility to support efficient domain-specific fine-tuning.
Key Features and Role in RL-TweetGen: LLaMA-3.1-8B excels in generating contextually rich, semantically coherent, and style-consistent tweets. Its support for long prompts and structured outputs enables it to maintain alignment across multiple tweet variants or parts of a campaign. Within the RL-TweetGen system, LLaMA is especially effective in tasks requiring expressive flexibility, semantic preservation, and uniform tone, making it ideal for generating tweets related to NFT project updates, event promotions, and community engagement.

3.5.3. DeepSeek LLM 7B Chat

DeepSeek LLM 7B Chat, developed by DeepSeek AI, is a 7-billion-parameter decoder only transformer designed with a focus on extensibility, lightweight deployment, and real-time performance. It supports efficient adaptation to domain-specific tasks, such as NFT tweet generation, through instruction tuning and modular fine-tuning techniques.

Architecture: The model employs Rotary Positional Embeddings and Pre-Layer Normalization, which enhance training stability and decoding efficiency. Its architecture is optimized for token-level performance and is fully compatible with Low-Rank Adaptation (LoRA), allowing for rapid fine-tuning to match domain-specific tone, vocabulary, and structure.
Key Features and Role in RL-TweetGen DeepSeek 7B Chat offers robust instruction-following capability and is optimized for generating concise, clear, and domain-aligned outputs. Its design emphasizes scalability and responsiveness, making it suitable for narrow domain tweet-generation tasks such as reusable NFT templates, collection highlights, FAQs, and community engagement messages. Within RL-TweetGen, DeepSeek functions as a lightweight and adaptable model for automated NFT tweet generation tasks.

3.6. Baseline Generation (Zeroshot LLM)

To establish a performance benchmark, a baseline tweet-generation module was implemented using zeroshot prompting on state-of-the-art openweight large language models (LLMs), specifically LLaMA-3.1-8B Instruct, Mistral-7B Instruct v0.1, and DeepSeek LLM 7B Chat. This component evaluated the capacity of pretrained, unadapted models to generate coherent NFT-related tweets from structured prompts without any domain-specific fine-tuning. The procedure was as follows:

Data Preparation: Prompts were extracted from the structured prompt dataset produced by the prompt-generation module.
Model Configuration: Each LLM was loaded using the Hugging Face Transformers library with quantized 4-bit weights (BitsAndBytes) for efficient inference on available GPUs. Sampling parameters were configured to encourage diverse but controlled outputs, including settings for maximum token-generation length, temperature-based sampling, nucleus sampling, and repetition penalties.
Prompt Formatting: Prompts were wrapped in a structured instruction template aligned with the Instruct format to encourage concise, engaging, and professional tweet generation.
Generation and Postprocessing: Each prompt was fed into the model’s text-generation pipeline to produce candidate tweets. The generated outputs were truncated to Twitter’s 280-character limit and filtered for minimum length to discard failures or incoherent generations.

This baseline zeroshot generation step provided a reference point for evaluating the effectiveness of supervised fine-tuning and reinforcement learning in subsequent stages of the RL-TweetGen system.

3.7. Supervised Model Fine-Tuning Using PEFT with LoRA

To adapt large language models (LLMs) for the domain-specific task of NFT tweet generation, we employ Parameter-Efficient Fine-Tuning (PEFT) using the Low-Rank Adaptation (LoRA) technique. This method introduces a minimal number of additional trainable parameters into the model, significantly reducing computational and memory requirements during fine-tuning, while keeping the pretrained weights frozen.

LoRA approximates the full-rank weight update

Δ W \in R^{d \times k}

using a low-rank decomposition, as shown in Equation (2):

Δ W \approx B A

(2)

where

B \in R^{d \times r}

,

A \in R^{r \times k}

, and

r ≪ min (d, k)

. The matrices A and B are the only trainable parameters introduced during fine-tuning.

Equation (2) represents the low-rank factorization of the weight update matrix

Δ W

into two smaller matrices. This is grounded in linear algebra, where any high-rank matrix can be closely approximated by the product of two low-rank matrices. This reduces the number of trainable parameters from

d \times k

to

r (d + k)

, enabling more efficient optimization.

The final adapted weight matrix is obtained by adding the low-rank update to the frozen pretrained weight matrix, as defined in Equation (3):

W_{LoRA} = W + Δ W = W + B A

(3)

Equation (3) shows how the low-rank update

Δ W = B A

is applied to the original weight matrix W. From an engineering perspective, this formulation enables fine-tuning with minimal additional memory and computation, as only A and B are updated while the original weights remain unchanged. This approach supports scalable, task-specific adaptation without degrading the generalization ability of the base model.

This design preserves the generalization capabilities of the base model while introducing sufficient flexibility to support efficient domain adaptation for NFT-specific linguistic patterns and engagement styles. Figure 2 illustrates the LoRA-based Parameter-Efficient Fine-Tuning (PEFT) strategy, which adapts pretrained language models to the domain-specific task of NFT tweet generation by introducing low-rank updates while keeping the base model weights frozen.

Figure 2. LoRA-based Parameter-Efficient Fine-Tuning (PEFT) strategy for adapting pretrained LLMs to NFT-specific tweet generation.

3.7.1. Base Model and Tokenizer Initialization

The fine-tuning process builds upon the zeroshot generation setup described earlier in Section 3.6. We initialize three pretrained, instruction-tuned, decoder-only transformer models: LLaMA, Mistral, and DeepSeek. These models are specifically designed for instruction-following tasks and possess the architectural capacity to generate high-quality, coherent, and context-aware NFT-related tweets.

Each model is coupled with its corresponding tokenizer, configured with consistent settings for padding, truncation, and attention masking. These configurations ensure uniform input sequence lengths across training batches, contributing to training stability and facilitating efficient inference during generation. This initialization step forms the foundation for subsequent domain-specific adaptation using parameter-efficient fine-tuning strategies.

3.7.2. Dataset Loading and Preprocessing

A curated dataset of NFT-related tweets is used for supervised fine-tuning. Each tweet is first tokenized using the corresponding tokenizer of the selected base model. The tokenized sequences are then truncated and padded to a fixed maximum length (typically 64 tokens) to ensure uniform input dimensions across batches.

The data is structured for Causal Language Modeling (CLM), where the objective is to predict the next token given all previous tokens. This autoregressive setup aligns naturally with the task of tweet generation, enabling the model to learn the sequential dependencies necessary for producing fluent and context-aware NFT tweets.

3.7.3. Base Models and Checkpoints

We fine-tuned the following frozen base model checkpoints using parameter-efficient LoRA adapters:

LLaMA-7B: meta-llama/Llama-2-7b-hf
Mistral-7B-Instruct: mistralai/Mistral-7B-Instruct-v0.2
DeepSeek-7B-Base: deepseek-ai/deepseek-llm-7b-base

3.7.4. LoRA Configuration and Target Modules

To enable parameter-efficient fine-tuning, LoRA adapters were injected into the attention projection layers of each transformer block. Only the LoRA parameters were updated; all pretrained model weights remained frozen.

Targets (LLaMA and DeepSeek): q_proj, v_proj
Targets (Mistral): q_proj, k_proj, v_proj, o_proj
LoRA Hyperparameters: rank $r = 16$ , scaling factor $α = 32$ , dropout $p = 0.05$

This configuration preserves the generalization capabilities of the base models while reducing GPU memory usage and enabling domain-specific adaptation for NFT tweet generation.

3.7.5. Training Configuration and Steps

Fine-tuning was implemented with the Hugging Face PEFT library under the Causal Language Modeling (CLM) objective, where the model learns to predict the next token by minimizing the negative log-likelihood loss.

Optimizer/Scheduler: AdamW, weight decay 0.01, linear warmup ratio 0.03;
Learning rate: 5 × 10⁻⁵;
Precision: Mixed FP16;
Batching: per-device batch size = 8; gradient accumulation steps = 4 (effective batch = 32);
Sequence length: T = 256 tokens with dynamic padding;
Max training steps: 1800 with early stopping after 3 consecutive evaluations without validation improvement;
Evaluation/Save frequency: every 100 steps; best checkpoint selected by validation perplexity.

3.7.6. Checkpointing and Artifacts

A structured checkpointing mechanism was implemented to ensure reliability, reproducibility, and rollback capability. Checkpoints were saved at multiple stages (e.g., steps 20, 1720, 1722) to capture both early and late training progress.

Saved: LoRA adapter weights, optimizer/scheduler states, random number generator state, tokenizer configuration and vocabulary, training arguments (including precision), and prompt template configuration.
Not saved: frozen base model weights (linked from the original checkpoint during loading).

Only LoRA adapter weights were stored to minimize storage overhead while retaining compatibility with the frozen base model.

This modular checkpointing strategy enables seamless recovery, reproducibility, and model portability while reducing resource usage—making it ideal for iterative, instruction-tuned fine-tuning pipelines. Figure 3, Figure 4 and Figure 5 visualizes key fine-tuning indicators across Mistral, LLaMA, and DeepSeek models, including learning rate evolution, gradient norm fluctuations, and loss reduction over training steps.

Figure 3. Visualization of key fine-tuning indicators for Mistral-7B: (a) learning rate evolution, (b) fluctuations in gradient norm, and (c) loss reduction over time.

Figure 4. Visualization of key fine-tuning indicators for LLaMA 8B—(a) learning rate evolution, (b) fluctuations in gradient norm, and (c) loss reduction over time.

Figure 5. Visualization of key fine-tuning indicators for Deepseek-7B—(a) learning rate evolution, (b) fluctuations in gradient norm, and (c) loss reduction over time.

3.7.7. Inference and NFT Tweet Generation

Following fine-tuning, the trained model is reloaded alongside its tokenizer and LoRA adapter modules for inference. Structured prompts (as in Section 3.4) are used as inputs to guide generation in an intent-aware, domain-specific manner.

To balance creativity and coherence during generation, we employ top-p (nucleus) sampling with

p = 0.7

, which selects from the smallest set of tokens whose cumulative probability exceeds the threshold. This decoding strategy introduces controlled diversity while ensuring that the output remains fluent and contextually appropriate.

The resulting NFT tweets are concise, semantically aligned, and reflect the linguistic tone, vocabulary, and communicative intent expected within NFT-related discourse, spanning various categories.

3.8. Diverse Generation with LSC Variation

To ensure that the generated NFT-related tweets are diverse, engaging, and tailored to a broad range of audience segments, we introduce the LSC Diverse Generation algorithm—where LSC refers to Length, Style, and Context. This method systematically generates tweet variations by iterating over all possible combinations of:

Length: short, medium, long;
Style: formal, informal, excited, casual, professional, friendly, humorous;
Context: art, investment, technology, community, gaming, collectibles, fashion, music.

For each unique combination, a structured prompt is dynamically constructed. This prompt integrates key elements such as context-specific hooks, relevant hashtags, strategically placed emojis, and an appropriate call-to-action (CTA) phrase—each tailored to reflect the intended length, stylistic tone, and thematic focus.

To optimize decoding performance, we empirically tuned the temperature and top-p parameters across multiple validation runs. The selected values—temperature = 0.9 and top-p = 0.7 (nucleus sampling)—consistently produced high-quality, engaging, and coherent NFT-related tweets, balancing creativity and relevance. Emojis were incorporated to enhance emotional resonance and stylistic expression in alignment with the tone specified in the prompt.

This LSC-driven approach promotes extensive linguistic and contextual coverage, resulting in tweet outputs that are not only coherent and fluent, but also stylistically diverse, semantically relevant, and well-aligned with varied NFT audiences. As detailed in Algorithm 3, the tweet-generation process leverages supervised fine-tuning with LoRA (PEFT) and incorporates Length, Style, and Context (LSC) variations to enhance output diversity, coherence, and relevance for NFT-specific communication.

Algorithm 3 Tweet generation using supervised fine-tuning with LoRA (PEFT) and LSC variation

1:: Input: Structured Prompt Dataset
2:: Output: Diverse NFT Tweets
3:: Supervised Fine-Tuning using LoRA (PEFT)
4:: Initialize pretrained decoder-only LLMs and tokenizer
5:: Format data for Causal Language Modeling (CLM) with input-output prompt pairs
6:: Tokenize sequences with padding and truncation
7:: Inject LoRA adapters into attention layers:

$Δ W = B A where B \in R^{d \times r}, A \in R^{r \times k}$

$W_{LoRA} = W + Δ W = W + B A$
8:: Freeze base weights and train only LoRA parameters
9:: Set training configuration
10:: Optimize model using CLM loss (next-token prediction)
11:: Reload model with trained LoRA adapters
12:: Generate tweets from fine-tuned model using top-p sampling
13:: LSC-Based Diverse Generation
14:: Define generation dimensions:
Length: short, medium, long
Style: formal, informal, excited, casual, professional, friendly, humorous
Context: art, investment, technology, community, gaming, collectibles, fashion, music
15:: for each LSC combination do
16:: Construct prompt using selected Length, Style, and Context
17:: Generate tweet using decoding parameters: temperature = 0.9, top-p = 0.7
18:: end for

3.9. Engagement Scoring Model

To estimate the potential engagement of generated NFT tweets, we employ an XGBoost-based regression model trained on historical NFT-related tweet data. The model integrates both textual and structural features to predict an engagement score S, which is later incorporated into the reinforcement learning reward function.

Feature Set: The input features used for training include:

Tweet Length: Total number of characters.
Word Count: Total number of words.
Presence of Emojis and Hashtags: Binary indicators capturing stylistic and topical emphasis.
Textual Features: TF-IDF vectorized representations of tweet content, including unigrams and bigrams.

Preprocessing Pipeline: The modeling pipeline begins with a TF-IDF Vectorizer to convert raw tweet text into a sparse, weighted vector format based on token frequency and importance. Simultaneously, numerical features (length, word count, emoji/hashtag presence, sentiment polarity, POS proportions) are normalized using a StandardScaler to ensure uniform scaling. Both branches—TF-IDF features and normalized numerical features—are fused via a ColumnTransformer, producing a unified feature matrix compatible with the XGBoost regressor.

Model Training and Output: The final feature matrix is used to train an XGBoost regression model, which learns to predict the engagement score S for each tweet. A log_1p transformation is applied to the raw engagement counts (likes, retweets, replies) to mitigate skewness caused by outliers. The predicted engagement score serves as a quantitative signal in the blended reward function during reinforcement learning, guiding generation toward content with higher potential for audience interaction.

Model Performance: The engagement predictor achieved an

R^{2}

score of 0.823 and an RMSE of 0.041 (normalized scale) on the held-out test set. Feature importance rankings, derived from XGBoost’s gain-based metric, and performance metrics are summarized in Table 3.

Table 3. XGBoost engagement model performance and features.

By combining textual semantics with structural indicators, the engagement scoring model enables RL-TweetGen to prioritize content that balances domain relevance, stylistic appeal, and high predicted engagement.

3.10. Reinforcement Learning Algorithm for Engagement-Optimized Tweet Generation

In this module, a novel reinforcement learning algorithm is proposed, specifically designed to optimize short text generation—such as tweets—for social media platforms, with a focus on the NFT ecosystem. This algorithm uniquely integrates predicted engagement modelling, human-in-the-loop feedback, advantage-based learning dynamics, and context-aware decoding strategies to maximize both quality and audience engagement in generated tweets. As presented in Algorithm 4, the proposed reinforcement learning strategy optimizes NFT tweet generation by maximizing engagement-driven rewards, aligning outputs with audience preferences, emotional tone, and community-specific language.

3.10.1. Reinforcement Learning Optimization

After supervised fine-tuning, we applied Proximal Policy Optimization (PPO) [] to optimize tweet generation for engagement signals (likes, retweets, replies).

Hybrid Reward Function:

R (t) = α \cdot \hat{L} (t) + β \cdot \hat{R} (t) + γ \cdot \hat{C} (t), α + β + γ = 1

(4)

where

\hat{L} (t)

,

\hat{R} (t)

, and

\hat{C} (t)

are min–max normalized predictions from a reward model trained on historical tweet engagement data.

Key Training Parameters:

RL algorithm: PPO;
Learning rate: 5 × 10⁻⁵;
Batch size: 64 episodes;
PPO epochs: 5 per update;
Clipping range: 0.2;
Entropy coefficient: 0.01;
Early stopping: 3 epochs without reward improvement.

Algorithm 4 Proposed reinforcement learning for engagement-optimized NFT tweet generation

1:: Input: Structured prompt dataset, pretrained LLM, reward components
2:: Output: High-engagement NFT tweets
3:: Tweet-Generation Environment
4:: Load Models
5:: Set max tweet length
6:: Generate tweet candidates from structured prompts with CTAs, emojis, hashtags
7:: Reward Computation
8:: Predict engagement score $\hat{S}$ using XGBoost
9:: Collect human feedback (1–5 scale), normalize as F = Feedback/5
10:: Compute reward: $R = 0.7 \cdot \hat{S} + 0.3 \cdot F$
11:: Advantage Calculation and Policy Update
12:: Compute advantage: $A = R - Baseline$
13:: if $A > 0$ then
14:: Increase temperature T (exploration)
15:: else
16:: Decrease temperature T (precision)
17:: end if
18:: Engagement-Aware Decoding Strategies
19:: Tailored Beam Search (TBS):
Score: $S (y_{1}, \dots, y_{k}) = \sum_{i = 1}^{k} score (y_{i})$ using keyword and intent relevance
20:: Enhanced Nucleus Sampling (Top-p):
Select tokens where cumulative probability exceeds p
21:: Contextual Temperature Scaling (CTS):
$P (y_{i} | x) = \frac{e^{score (y_{i}) / T}}{\sum_{j} e^{score (y_{j}) / T}}$ with T adjusted by context and A
22:: Normalize expert feedback and update reward for next generation cycle

3.10.2. Tweet-Generation Environment

The algorithm operates in a simulated tweet-generation environment, where an instruction-tuned language model (LLaMA 3.1-8B, Mistral 7B, DeepSeek 7B) produces candidate tweets in response to structured prompts. These tweets are constrained to 280 characters and optimized for NFT marketing, incorporating elements such as calls-to-action (CTAs), emojis, and topic-specific hashtags (e.g., #NFTArt, #CryptoCollectibles).

3.10.3. Blended Reward Function

Each generated tweet is evaluated using a hybrid reward signal combining both predicted and qualitative feedback, as defined in Equation (5):

R = 0.7 \times \hat{S} + 0.3 \times (\frac{Feedback Score}{5})

(5)

where

\hat{S}

is the predicted engagement score from an XGBoost model trained on historical NFT tweets using features such as keyword frequency, sentiment scores, punctuation usage, and historical engagement metrics. Feedback Score is a human rating on a 1–5 scale.

This hybrid reward strikes a balance between quantitative virality and linguistic and stylistic resonance for the NFT audience.

3.10.4. Advantage-Based Policy Update

To guide reinforcement learning and dynamically adapt generation strategies, the algorithm computes an advantage value, as defined in Equation (6):

A = R - Baseline

(6)

where the Baseline is the moving average of recent reward scores, serving as a dynamic performance reference.

Based on the computed advantage, the system adjusts the generation temperature (T) as follows:

If $A > 0$ , increase T→ encourages exploration and creative diversity.
If $A < 0$ , decrease T→ promotes precision and adherence to proven tweet patterns.

This dynamic adjustment mechanism allows the algorithm to fine-tune its balance between exploration and exploitation.

3.10.5. Engagement-Aware Decoding Strategies

To improve the quality, diversity, and contextual alignment of generated tweets, the proposed algorithm integrates three advanced decoding strategies:

Tailored Beam Search (TBS): An enhanced version of standard beam search that incorporates domain-specific heuristics for output reranking. Candidate sequences are evaluated based on factors such as keyword relevance, intent alignment, and domain-specific embedding similarity. The score of a candidate sequence is computed by summing token-level contributions, as defined in Equation (7):

$S (y_{1}, \dots, y_{k}) = \sum_{i = 1}^{k} score (y_{i})$

(7)
Enhanced Top-p (Nucleus) Sampling: A controlled sampling strategy that selects the next token from the smallest set of tokens whose cumulative probability exceeds a predefined threshold p. This encourages stylistic diversity and creativity while maintaining fluency, as shown in Equation (8):

$P (y_{1}, \dots, y_{k}) = \sum_{i = 1}^{k} I (P (y_{i}) \geq p)$

(8)
Contextual Temperature Scaling (CTS): This method dynamically adjusts the softmax temperature based on the thematic context of the tweet. The temperature T is initialized per NFT topic and updated online using the computed advantage signal. The probability distribution over tokens is computed as shown in Equation (9):

$P (y_{i} ∣ x) = \frac{exp (\frac{score (y_{i})}{T})}{\sum_{j} exp (\frac{score (y_{j})}{T})}$

(9)

3.10.6. Evaluation and Iteration

A continuous feedback loop evaluates each generated tweet along four dimensions: contextual relevance, engagement potential, clarity and coherence, and stylistic quality. This process is driven by expert-based evaluation, where scores are assigned on a 5-point Likert scale, normalized to the range [0, 1], and integrated into the reward function to refine the reinforcement learning process. To benchmark model performance, tweet outputs from each language model (LLaMA, Mistral, and DeepSeek) are independently assessed through domain-informed expert judgment, with model identities withheld to ensure unbiased scoring. A weighted average of the expert scores, with greater emphasis on engagement potential, reflects the practical priorities of NFT-focused content generation.

4. Experimental Procedure

This section outlines the experimental setup used to evaluate the RL-TweetGen system. It includes details on system configuration, dataset preparation, model configuration, and the design of a simulated reinforcement learning environment. These components collectively ensure a comprehensive and rigorous evaluation of the proposed system.

4.1. System Configuration

The RL-TweetGen system was implemented and evaluated on a local high-performance workstation equipped with an NVIDIA Quadro P5000 GPU (16 GB VRAM), CUDA version 12.4, and Python 3.10.12. The system utilized PyTorch 2.4.1 for deep learning and model training, while integration with large language models (LLMs) was facilitated via Hugging Face Transformers v4.41.0. TensorFlow 2.8.0 supported auxiliary tasks. For natural language preprocessing and evaluation, the system employed NLTK, spaCy, scikit-learn, and SentenceTransformers. Engagement-prediction modeling was performed using XGBoost. Logging and visualization were managed with Weights and Biases (wandb), Matplotlib 3.9.2, and Seaborn 0.13.2. This software and hardware configuration enabled efficient tokenization, supervised fine-tuning, reinforcement learning optimization, and evaluation of tweet-generation quality and engagement potential.

4.2. Dataset Preparation

The dataset preparation process for NFT tweet generation was meticulously carried out through a series of steps including dataset selection, sampling, preprocessing, semantic annotation, engagement scoring, and partitioning. Each phase was designed to ensure the quality, representativeness, and relevance of data for training engagement-optimized text-generation models.

Dataset Source: The primary data source was the publicly available “Verified NFT Tweets” dataset from Kaggle [], which comprises tweets from verified NFT-related accounts posted between 2020 and 2022. This dataset provides a diverse and temporally rich collection of domain-specific content relevant to NFT discourse across various communities.
Sampling Strategy: To curate a high-quality and balanced subset, stratified sampling was employed. The sampling was guided by two key stratification variables:
- Engagement Score: A measure of user interaction intensity.
- Semantic Intent Category: Reflecting the thematic purpose of the tweet.
This resulted in a curated dataset of 1436 representative tweets, ensuring both engagement diversity and semantic coverage.
Preprocessing Steps: To prepare the dataset for downstream modeling tasks, a comprehensive preprocessing pipeline was applied:
- Deduplication: Removed exact duplicate tweets.
- Minimum Length Filter: Excluded tweets with fewer than 20 characters to eliminate noise.
- Language Filtering: Retained only English tweets using the langid library for language detection.
- Encoding Normalization: Converted from Latin-1 to UTF-8 to ensure compatibility.
- Character Cleaning: Removed corrupted and non-ASCII characters using regular expressions.
- Formatting Standardization: Normalized punctuation, whitespace, and special formatting artifacts.
  This pipeline ensured clean, consistent input for both semantic and engagement modeling.
Engagement Score Computation: Each tweet was assigned an Engagement Score, computed using a weighted formula that prioritizes deeper interactions by assigning higher importance to replies and retweets over likes, following the scoring strategy as outlined in Equation (1). To categorize tweets into discrete levels, quantile-based binning was applied, resulting in three engagement classes:Low, Medium or High. These engagement labels served as key supervision signals during reinforcement learning optimization.
Quantile-based binning was applied to assign each tweet a discrete engagement class: Low, Medium, or High.
Semantic Intent Classification: Tweets were also categorized into one of six semantic intent classes, capturing their primary communicative goals: Art, Gaming, Music, Photography, Membership, and Profile Pictures (PFPs).
Annotation Methodology
- Weak Supervision: Employed domain-specific keyword heuristics and regular expression rules to generate initial pseudo-labels.
- Feature Engineering: Used TF-IDF vectorization including bigrams to capture relevant n-gram patterns.
- Model Selection: Evaluated several classifiers; a Gradient Boosting Classifier was selected based on superior validation performance.
  This two-stage process enabled high-quality semantic labeling with minimal manual effort.
Dataset Partitioning: To facilitate effective training, fine-tuning, and evaluation of language models, the dataset was partitioned as follows: 80% Training, 10% Validation, 10% Testing. This stratified split preserved the class distribution across both engagement and semantic categories to avoid data imbalance and ensure fair evaluation.

4.3. Model Configuration

To evaluate the effectiveness of different strategies in optimizing NFT tweet generation for engagement, we implemented a multi-stage model configuration pipeline. This pipeline integrates baseline generation, supervised fine-tuning, stylistic and contextual diversification, engagement prediction, reinforcement learning, and decoding enhancements. Each stage is detailed below:

Baseline Zeroshot Tweet Generation:
As an initial benchmark, we assessed the generative capabilities of pretrained instruction-tuned large language models (LLMs) without any domain-specific adaptation or fine-tuning.
- Models Evaluated: We employed three open-weight LLMs: LLaMA-3.1-8B Instruct, Mistral-7B Instruct v0.1, DeepSeek LLM 7B Chat.
- Prompting Strategy: Prompts were derived from cleaned NFT tweets using instruction-aligned templates that aligned with instruction-tuning conventions.
- Model Inference Details: Generation was performed using the Hugging Face Transformers library with 4-bit quantization through BitsAndBytes to minimize memory overhead and enable multi-model experimentation.
- Decoding Configuration: Temperature = 0.9, Top-p = 0.7, and Repetition penalty = 1.1.
- Post-Processing: Outputs were truncated to 280 characters (Twitter constraint) and filtered for coherence and completeness.
Supervised Fine-Tuning with LoRA:
To adapt general-purpose LLMs to the NFT domain, we applied Low-Rank Adaptation (LoRA) using the PEFT (Parameter-Efficient Fine-Tuning) approach.
- LoRA adapters were injected into the q_proj and v_proj layers in all models, with k_proj and o_proj also used in Mistral for greater expressiveness.
- Fine-tuning parameters:
  -
  Epochs: 3, Learning rate: $5 \times 10^{- 5}$ , Batch size: 8–16, Sequence length: 64 tokens.
- Objective: Causal Language Modeling (CLM) to train next-token prediction.
- Inference used decoding settings: temperature = 0.7, top-p = 0.9, repetition penalty = 1.2.
- Outputs were limited to 280 characters for platform compatibility.
LSC-Based Diverse Tweet Generation:
To enhance stylistic diversity and better target varied NFT audiences, the proposed system applied the LSC (Length, Style, Context) framework post fine-tuning.
- The three LSC dimensions consisted of:
  -
  Length: short, medium, long;
  -
  Style: formal, informal, excited, casual, professional, friendly, humorous;
  -
  Context: art, investment, technology, community, gaming, collectibles, fashion, music.
- For each LSC combination, custom prompts were generated using targeted hashtags, emojis, call-to-actions (CTAs), and domain-specific hooks.
- Sampling temperatures were dynamically adjusted based on the style dimension. For instance, higher temperatures were used to support creativity in humorous tweets, whereas lower temperatures supported coherence in professional or formal styles.
Engagement Scoring Model:
An XGBoost-based regression model was developed to predict engagement scores for generated tweets and serve as a reward signal.
- Input features included: tweet length, word count, presence of emojis and hashtags, TF-IDF vectors, sentiment polarity scores, and part-of-speech tags.
- Feature preprocessing was conducted using StandardScaler and ColumnTransformer to normalize numeric and text features appropriately.
- The model was trained on historical NFT-related tweets with annotated engagement metrics (likes, retweets, replies).
- The predicted engagement score was used as a core component of the reward function during reinforcement learning.
Simulated Reinforcement Learning Environment:
A simulated tweet-generation environment was constructed to optimize output quality using a blended reward mechanism.
- Fine-tuned models generated tweets conditioned on various LSC-driven prompt combinations.
- The reward signal consisted of a weighted combination of the predicted engagement score (from the XGBoost model) and expert-curated human feedback scores.
- A moving average reward baseline was maintained to stabilize policy updates.
- The advantage term $A = R - baseline$ was used for reinforcement learning. Based on the sign of A, contextual temperature was adjusted:
  -
  $A > 0$ : temperature was increased to promote exploration.
  -
  $A < 0$ : temperature was reduced to improve consistency and precision.
Advanced Decoding Strategies:
To balance creativity and relevance in generation, the system implemented three decoding techniques:
- Tailored Beam Search (TBS): Beam candidates were re-ranked using NFT-specific heuristics such as keyword alignment, semantic similarity to training data, and domain topic relevance (Equation (7)).
- Enhanced Top-p Sampling: Tokens were selected from the minimal set whose cumulative probability exceeded a threshold p, enabling more diverse output while controlling randomness (Equation (8)).
- Contextual Temperature Scaling (CTS): The temperature value for decoding was adapted based on thematic context (e.g., art vs. gaming), ensuring stylistic consistency (Equation (9)).

4.4. Evaluation and Feedback Loop

A comprehensive evaluation loop was designed using both automated and human-in-the-loop feedback. Human evaluators scored generated tweets across four axes: contextual relevance, engagement potential, clarity and coherence, and stylistic quality, using a 5-point Likert scale. These scores were normalized and reintegrated into the reward function. Additionally, model outputs were benchmarked by sampling tweets per model across three phases—zeroshot, supervised fine-tuning, and RL-optimized generation. Performance was assessed using engagement prediction, Likert-based feed-back scores, and qualitative analysis, allowing for detailed comparison across models and stages.

4.5. Statistical and Reliability Analysis

To ensure the robustness and interpretability of our evaluation results, we incorporated the following statistical tests, variance reporting, and inter-rater reliability measures:

Assumptions and Variance Testing: Before conducting statistical comparisons between model outputs, we performed Levene’s test to assess homogeneity of variances. For normally distributed metrics, ANOVA was used; otherwise, non-parametric tests were employed.
Statistical Significance Testing: Pairwise model comparisons were conducted using the Wilcoxon signed-rank test (non-parametric) and validated through pairwise bootstrap resampling with $n = 10, 000$ iterations to assess the stability of observed differences. Statistical significance was reported at $α = 0.05$ .
Error Bars, Standard Deviation, and Confidence Intervals: All reported metric means are accompanied by standard deviations. For plots, 95% confidence intervals were visualized using error bars to illustrate variance across multiple runs and prompt conditions.
Engagement Model Reporting: The XGBoost engagement predictor achieved an $R^{2}$ score of 0.823 and RMSE of 0.041 (normalized scale) on the test set. Input features included tweet length, word count, emoji/hashtag presence, sentiment polarity, TF-IDF vectors, and POS-tag proportions. A log_1p transformation was applied to likes, retweets, and replies to mitigate skewness.
Inter-Rater Reliability: Human evaluation scores from three annotators were tested for agreement using both Cohen’s $κ$ (pairwise) and Krippendorff’s $α$ (multi-rater). Results indicated substantial agreement with $κ = 0.79$ and $α = 0.82$ .
Semantic Similarity Metrics: In addition to BLEU, ROUGE, and METEOR, we included BERTScore (F1) using a pre-trained all-MiniLM-L6-v2 model from SentenceTransformers to capture semantic alignment between generated and reference tweets.

5. Results

This section is critical for validating the effectiveness of the proposed RL-TweetGen system through both quantitative and qualitative analysis. It demonstrates how each component contributes to performance gains and highlights the impact of reinforcement learning on generating engagement-optimized tweets. The experimental findings are presented along with a comprehensive analysis of the system’s performance. The section includes a comparative evaluation of tweet semantic classification, lexical and semantic performance metrics, and a detailed analysis across baseline and fine-tuned models. Key insights and interpretations are provided, with particular emphasis on the role of reinforcement learning in enhancing output quality and engagement relevance.

5.1. Comparative Analysis for Tweet Semantic Classification

To assess the effectiveness of the semantic classification pipeline, all models were trained and evaluated using stratified splits of the annotated dataset. The performance of each model was measured using accuracy, precision, recall, and F1-score, as summarized in Table 4 and visualized in Figure 6.

Table 4. Performance comparison of different classification models.

Figure 6. Performance comparison of various classifiers based on accuracy, precision, recall, and F1-score for semantic classification of NFT tweets. Gradient Boosting outperformed all other models.

Among the classifiers, Gradient Boosting achieved the best overall results, with a validation accuracy of %, precision of 0.76, recall of 0.76, and F1-score of 0.75, outperforming all other classifiers evaluated in this study, including Logistic Regression (61%), Linear SVM (68%), SGD Classifier (70%), Random Forest (66%), and Decision Tree (69%).

Gradient Boosting works by iteratively refining predictions through the stage-wise addition of weak learners. At each iteration m, it updates the model using the formula shown in Equation (10):

F_{m} (x) = F_{m - 1} (x) + γ_{m} h_{m} (x),

(10)

where

F_{m - 1} (x)

is the ensemble prediction from the previous iteration,

h_{m} (x)

is the weak learner trained on the residuals (i.e., the negative gradient of the loss function), and

γ_{m}

is the learning rate that controls the contribution of each learner.

This process allows the model to progressively correct errors, capture complex nonlinear relationships, and reduce both bias and variance. Its superior performance in capturing semantic patterns in NFT-related tweets validates its selection as the final classifier for generating structured labels in downstream tasks.

5.2. Lexical Performance Evaluation: Metrics and Results

Lexical metrics reveal how closely generated tweets match reference tweets at the surface level—focusing on n-gram precision, recall, and sequence similarity. We report BLEU, METEOR, ROUGE-1, ROUGE-2, and ROUGE-L. As shown in Table 5, the lexical performance of the models varied significantly across training types.

Table 5. Performance comparison of pretrained, fine-tuned, and RL-enhanced models on lexical overlap metrics.

5.2.1. BLEU Score

The BLEU (Bilingual Evaluation Understudy) score computes modified precision over n-grams. A brevity penalty (BP) is applied to penalize overly short hypotheses. The BLEU score is computed using the formula shown in Equation (11):

BLEU Score = BP \cdot exp (\sum w_{n} \cdot log p_{n})

(11)

where

w_{n}

is the weight assigned to the n-gram precision

p_{n}

, and BP is the brevity penalty.

For the BLEU scores, which reflect token-level precision of generated tweets against references, the Mistral model demonstrated clear and consistent improvement across all training stages. Starting with a BLEU score of 0.2144 at the base stage, it increased to 0.2240 after domain-specific fine-tuning, and further rose to 0.2285 following reinforcement learning optimization. This steady upward trend indicates that both fine-tuning and the engagement-driven RL reward were effective in aligning Mistral’s lexical choices with those found in authentic NFT tweets.

In contrast, the LLaMA model experienced only a modest improvement: it began with a BLEU score of 0.2058 at the base stage, climbed to 0.2110 after fine-tuning, but then slightly declined to 0.2075 post-RL. This suggests that while LLaMA benefited somewhat from exposure to NFT tweet data during fine-tuning, the RL stage did not consistently reinforce lexical alignment, possibly due to increased generation diversity aimed at engagement.

Meanwhile, the DeepSeek model exhibited minimal variation in BLEU scores during training, consistently ranging between 0.1890 and 0.1900. This stagnation suggests a limited ability to capture or adapt to the nuanced lexical patterns characteristic of NFT-related tweets, potentially due to architectural constraints or less domain-aligned pretraining data.

5.2.2. METEOR Score

METEOR (Metric for Evaluation of Translation with Explicit ORdering) improves upon BLEU by combining unigram precision and recall while introducing an alignment penalty for fragmented matches. The score is computed using Equation (12):

METEOR Score = F_{mean} \cdot (1 - P)

(12)

where

F_{mean}

is the harmonic mean of unigram precision and recall, and P is the fragmentation penalty based on the number of chunks in the alignment.

The METEOR scores, which evaluate phrase-level alignment through a combination of precision, recall, and semantic similarity, revealed contrasting dynamics across the three models. Mistral began with a METEOR of 0.3120 in the zeroshot baseline and showed a modest improvement to 0.3164 after supervised fine-tuning, demonstrating its ability to better capture paraphrastic variations and synonymic patterns common in NFT discourse. However, after reinforcement learning, its METEOR score dipped slightly to 0.3110, suggesting that while the RL phase optimized for engagement, it may have introduced more creative but less directly reference-aligned phrasing, slightly sacrificing phrase-level precision for diversity.

Conversely, LLaMA exhibited a gradual rise, with its METEOR peaking at 0.3169 after RL fine-tuning—higher than both its base and fine-tuned-only scores—indicating that LLaMA benefited from the RL stage in terms of learning richer and more flexible phrase patterns that better matched NFT tweet references.

DeepSeek, however, exhibited comparatively lower METEOR scores, never surpassing 0.2399 at any stage. This consistent lag suggests it struggled to capture the syntactic and phrasal patterns typical of NFT marketing tweets, highlighting its limited adaptability to the nuanced linguistic demands of this specialized domain.

5.2.3. ROUGE Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) evaluates the quality of generated text by measuring overlaps with reference text. ROUGE-1, ROUGE-2, and ROUGE-L assess unigram, bigram, and longest common subsequence (LCS) similarity, respectively, as shown in Equations (13)–(15):

ROUGE-1 = \frac{Number of overlapping unigrams}{Total unigrams in reference}

(13)

ROUGE-2 = \frac{Number of overlapping bigrams}{Total bigrams in reference}

(14)

ROUGE-L = \frac{LCS (X, Y)}{Length of reference (Y)}

(15)

Among the three models, Mistral demonstrated the most consistent and substantial improvements across all ROUGE metrics through both fine-tuning and reinforcement learning. ROUGE-1 increased steadily from 0.3362 (base) to 0.3484 (fine-tuned), and further to 0.3542 after RL. ROUGE-2 showed a similar trend, improving from 0.2178 to 0.2409 with fine-tuning, followed by a minor decline to 0.2357 post-RL. ROUGE-L also rose from 0.3137 (base) to 0.3366 (fine-tuned), and peaked at 0.3413 with RL. These gains indicate that Mistral became increasingly effective at reproducing not just key lexical items, but also the structural patterns characteristic of NFT-related social media content.

In contrast, LLaMA demonstrated particular strength in semantic fluency and phrase-level diversity. Its ROUGE-2 score improved from 0.2350 to 0.2521 after fine-tuning and further increased to 0.2588 following RL—representing the highest bigram overlap among all models. However, ROUGE-1 and ROUGE-L showed a marginal decline post-RL, decreasing from 0.3368 to 0.3305 and from 0.3315 to 0.3281, respectively. This suggests that while RL enhanced LLaMA’s expressive variation, it slightly reduced its ability to reproduce exact lexical sequences and structural matches.

DeepSeek, in contrast, consistently delivered weaker performance across all evaluation stages. Although fine-tuning led to modest improvements—such as an increase in ROUGE-1 from 0.2422 to 0.2471 and ROUGE-2 from 0.2008 to 0.2118—the reinforcement learning phase did not yield significant gains. In fact, some metrics slightly declined, with ROUGE-2 dropping to 0.2074 and ROUGE-L to 0.2341. These findings indicate DeepSeek’s limited effectiveness in capturing domain-specific lexical patterns and reproducing the stylistic and structural nuances characteristic of NFT-related tweets.

Collectively, the ROUGE-based analysis highlights the distinct strengths of each model. Mistral excels in lexical fidelity and structural fluency, making it ideal for precise, high-clarity tweet generation. LLaMA offers superior performance in phrase-level diversity and semantic fluidity, positioning it well for stylistically varied and engaging content.

5.3. Semantic Performance Evaluation: Metrics and Results

Lexical scores alone cannot guarantee preserved meaning. We therefore computed BERT-based metrics and cosine similarity to evaluate whether generated tweets capture the semantics of the references. As shown in Table 6, BERT-based and cosine similarity scores validate that semantic consistency improves post fine-tuning.

Table 6. Performance comparison of pretrained, fine-tuned, and RL-enhanced models on semantic similarity metrics.

5.3.1. BERTScore

BERTScore measures semantic similarity between a generated tweet and its reference by computing cosine similarity between contextualized BERT embeddings.

Let

{c_{1}, \dots, c_{m}}

and

{r_{1}, \dots, r_{n}}

be token sets from the candidate and reference, respectively. The score is:

BERTScore = \frac{1}{m} \sum_{i = 1}^{m} max_{1 \leq j \leq n} cos (e (c_{i}), e (r_{j}))

where

e (\cdot)

denotes BERT embeddings and cos is cosine similarity.

Semantic analysis based on BERTScore F1 revealed a consistent improvement in Mistral’s ability to generate tweets that not only match target vocabulary but also capture the underlying meaning and conceptual alignment with reference NFT tweets. Specifically, Mistral’s BERT_F1 increased from 0.8240 in the zeroshot baseline to 0.8323 following supervised fine-tuning, culminating at 0.8461 after reinforcement learning. This steady upward trend indicates that each stage of adaptation enabled Mistral to produce outputs with higher semantic fidelity—capturing nuanced references, context-specific terms, and implied meanings critical in NFT marketing discourse.

Conversely, LLaMA exhibited a smaller improvement trajectory, with BERT_F1 peaking at 0.8155 post-RL. While this suggests some gains in phrase-level understanding, it consistently lagged behind Mistral, highlighting LLaMA’s relative difficulty in grasping nuanced NFT-specific semantics even after fine-tuning and RL optimization.

Meanwhile, DeepSeek consistently exhibited lower performance, with BERT_F1 scores plateauing below 0.7907 across all stages. This persistent limitation suggests challenges in semantically adapting to the distinctive conceptual landscape of NFT-related tweets, likely due to constraints in its pretrained knowledge or architectural alignment with the domain. Overall, these findings underscore that while all models benefitted from fine-tuning and reinforcement learning, Mistral demonstrated a markedly superior ability to generate content attuned to the evolving linguistic and thematic characteristics of the NFT discourse.

5.3.2. Cosine Similarity

Cosine similarity was computed between the mean-pooled BERT embeddings of the generated and reference tweets to measure their semantic alignment:

Cosine Similarity = \frac{u \cdot v}{∥ u ∥ ∥ v ∥}

where

u

and

v

are the averaged embedding vectors of the generated and reference tweets, respectively.

LLaMA exhibited a significant and consistent rise in COSINE similarity, improv-ing from 0.6860 in the baseline to 0.7173 after reinforcement learning. This notable in-crease suggests that LLaMA’s tweets evolved to better capture the stylistic and topical diversity inherent in NFT discourse, aligning more closely with the conceptual space of authentic NFT-related content. The growth in LLaMA’s cosine scores implies that reinforcement learning effectively guided the model toward broader, more nuanced semantic representations consistent with varied NFT topics—from digital art to gaming collectables.

In contrast, Mistral experienced a slight decline in COSINE similarity during the reinforcement learning stage, dropping from 0.5023 to 0.4861. This suggests that although RL enhanced lexical creativity and phrase diversity, it may have occasionally driven the model to explore expressions deviating from precise semantic alignment, favoring novelty over exact meaning.

Meanwhile, DeepSeek’s COSINE similarity remained consistently low, never surpassing 0.4735—highlighting its difficulty in generating tweets that semantically align with reference NFT content. These results illustrate a trade-off between creativity and semantic fidelity: LLaMA’s improvements reflect effective adaptation to NFT stylistics, while Mistral’s slight decline suggests that reinforcement learning-driven exploration may occasionally detract from strict semantic consistency. DeepSeek’s consistently flat performance highlights its limited capacity to grasp and reproduce the semantic nuances specific to NFT discourse.

5.4. Performance Analysis and Comparative Evaluation

The collective results from both lexical and semantic evaluations reveal distinct patterns in model adaptation and performance:

Supervised fine-tuning, consistently improved all lexical (BLEU, ROUGE) and semantic (METEOR, COSINE) metrics across models. This underscores the value of domain-specific training: exposure to authentic NFT tweets significantly enhanced each model’s ability to generate accurate and contextually relevant outputs.
Reinforcement learning (RL), further optimized lexical precision, most notably for Mistral, which demonstrated continued gains in BLEU and ROUGE scores. However, RL introduced a nuanced trade-off: while lexical diversity and phrase-level accuracy improved, semantic fidelity occasionally declined, as seen in the slight drop in Mistral’s COSINE score post-RL. This suggests that RL’s exploratory behavior may at times favor creative variation over strict semantic alignment.
In comparative performance, Mistral emerged as the most precise model at the token level, achieving the highest BLEU, ROUGE, and BERTScore F1 scores. This makes it particularly effective for generating tweets that adhere closely to expected lexical patterns. In contrast, LLaMA excelled in semantic preservation and expressive diversity, attaining the highest COSINE similarity scores. This indicates LLaMA’s superior capacity to maintain or creatively reinterpret the intended meaning of NFT-related content, making it better suited for use cases requiring stylistic variety and user engagement.
DeepSeek, exhibited modest gains after fine-tuning and reinforcement learning; nevertheless, it persistently trailed behind in both lexical and semantic metrics, highlighting its limited adaptability to domain-specific language and its constrained capacity to capture the expressive and stylistic subtleties of NFT-related discourse.
Overall, these findings highlight that while both fine-tuning and RL improve tweet-generation quality, model selection should be guided by the specific marketing objective—whether the goal is precision and keyword fidelity (Mistral) or creative expressiveness and engagement (LLaMA).

6. Discussion

This section analyzes the implications of our approach, including the effectiveness of reinforcement learning (RL) for optimizing tweet generation in the NFT domain, the challenges of generalizing to broader contexts, and the ethical considerations of engagement-based optimization. Key findings are interpreted in the context of generative AI trends and marketing automation.

6.1. Interpretation of Key Findings

The evaluation reveals a lexical–semantic trade-off introduced by reinforcement learning with engagement-focused rewards. While RL consistently improves surface-level metrics such as BLEU and ROUGE, it can reduce semantic alignment scores, including cosine similarity. This reflects a systematic bias toward generating more creative and attention-grabbing tweets at the expense of strict meaning preservation—a trade-off that may be advantageous in NFT communities, where novelty and expressiveness often outweigh factual precision.

Model specialization was evident. Mistral achieved higher lexical precision, making it well-suited for keyword-driven and consistent messaging. LLaMA excelled at maintaining thematic coherence while introducing stylistic variation, producing semantically consistent yet diverse expressions—valuable for maintaining engagement in dynamic social media contexts. DeepSeek showed modest improvements but struggled with adaptation and expressiveness, highlighting the influence of base model capacity.

Integrating engagement-predictive signals into the reward function—combining model-predicted and human-rated engagement—systematically biased outputs toward stylistic and thematic norms prevalent in NFT communities. This demonstrates how RL can align technical capabilities with socio-cultural expectations, improving relevance and impact in niche online ecosystems.

Overall, supervised fine-tuning improved both lexical and semantic quality, while RL optimization enhanced fluency and high-engagement linguistic patterns. The trade-off between novelty and strict factual alignment underscores the need for careful reward function design.

6.2. Generalizability and Domain Transfer

RL-TweetGen’s combination of domain-specific fine-tuning with engagement-oriented RL shows strong potential for adaptation beyond NFTs. A comprehensive lexical–semantic evaluation suite—incorporating BLEU, METEOR, ROUGE (ROUGE-1, ROUGE-2, ROUGE-L), BERT-based precision, recall, and F1, and cosine similarity—captures both surface-level accuracy and deeper semantic fidelity, supporting transfer to other short-form, engagement-driven domains such as promotional posts, viral memes, and news headlines.

The divergence in model performance reinforces the importance of strategic model selection: Mistral is preferable for precision and keyword fidelity, while LLaMA is better suited for creativity and stylistic diversity. Matching generation strategies to communicative goals and cultural contexts is essential for optimal performance.

While the evaluation framework and modular RL reward structure are domain-agnostic, NFT-specific fine-tuning embeds community-specific patterns (e.g., slang, token mentions) that reduce out-of-domain performance without retraining. Generalization tests revealed reduced coherence and relevance for unfamiliar prompts, indicating the need for domain-adaptive pretraining or modular decoder design to enable broader transfer.

6.3. Ethical Implications and Societal Impact

Optimizing purely for engagement introduces ethical risks, including bias reinforcement, clickbait amplification, and manipulative content patterns—particularly in hype-driven markets like NFTs. Predictive engagement models may favor historically popular styles, potentially marginalizing minority perspectives or emerging creative forms.

Domain overfitting is another concern: models tuned exclusively for NFT discourse may propagate niche jargon, speculative narratives, or investment hype, limiting transferability and potentially influencing volatile market behaviors. Ethical deployment requires bias audits, transparency, and mechanisms to maintain authenticity and trust.

Specific risks associated with RL-TweetGen include the reinforcement of bias, where training on NFT tweets may replicate sensationalism, exclusionary jargon, or stereotypes; the promotion of clickbait and manipulative content, as engagement-based rewards could prioritize virality over factual accuracy or ethical responsibility; and social manipulation, where in speculative markets, automated tweets may influence sentiment, discourse, or investment decisions. To address these concerns, mitigation strategies have been implemented, including bias detection and toxicity filtering, human-in-the-loop oversight during deployment, the use of multi-objective reinforcement learning to balance engagement, informativeness, and ethical considerations, and counter-bias sampling during both training and inference stages.

6.4. Limitations and Future Work

The current system is limited to text-only generation, omitting the rich multimedia elements central to NFT promotion. As with all data-driven models, it may inherit latent biases from training data, influencing tone, diversity, or framing. Reward sparsity, short-text credit assignment, and exploration costs remain technical challenges.

Additional limitations of this study include its focus solely on NFTs, which constrains broader applicability, and the reliance on engagement labels derived from historical metrics, potentially introducing bias or noise. The evaluation scope was limited by the absence of manual assessments and live A/B testing, restricting insights into perceived quality and user reception. Furthermore, the system has not undergone real-time deployment to measure downstream engagement effects, and uncertainty quantification—such as confidence intervals and error estimates—was not incorporated.

Future work will address these gaps by extending RL-TweetGen to multi-modal generation incorporating images, metadata, and sentiment trends, and by implementing multi-domain fine-tuning with task-switching capabilities. Additional enhancements will include incorporating reinforcement learning with human feedback (RLHF) for nuanced preference alignment, integrating bias-aware controls and stylistic modulation, introducing uncertainty estimation and variance analysis, and conducting user studies alongside real-time deployments to evaluate trust, impact, and overall acceptability.

7. Testcases

This section presents representative test cases that illustrate the performance progression of the RL-TweetGen system. For each test case, the original tweet is presented first, followed by the tweet outputs generated at three different stages. The zeroshot generation is produced by the base pretrained model without any NFT-specific adaptation. The fine-tuned output is generated using a language model trained on NFT-related content to improve domain relevance and stylistic fluency. The RL-enhanced output is refined using reinforcement learning techniques to further optimize for NFT audience engagement, clarity, and actionability.

Each model’s output is accompanied by expert observations from an NFT marketing perspective, assessing: semantic alignment with NFT terminology and culture, clarity and emotional resonance for collectors and enthusiasts, the strength of the call-to-action, and overall promotional impact within the Web3 and NFT community.

Ground Truth Tweet for James Jean Black Friday NFT Drop: $500 Dollars Off! #Black-Friday Deal! https://t.co/efCdhBXAfL (accessed on 21 July 2025) #art #painting #jamesjean #nft #holidaygiftguide #giftguide #deals

Ground Truth Tweet for Ugonzo x Stephen NFT Collab Drop: Surprise NFT collab drop with @Ugonzo_art x @Stephen_Animate #nft #nftcommunity #nftcollectors #pinknfts

As illustrated in Table 7 and Table 8, the RL-enhanced model demonstrates a significant improvement in emotional appeal and seasonal relevance for the James Jean Black Friday NFT drop, while also showcasing enhanced narrative clarity and call-to-action strength in the Ugonzo x Stephen collaborative release.

Table 7. James Jean Black Friday NFT drop.

Table 8. Ugonzo x Stephen NFT Collab drop.

8. Conclusions

This study introduced RL-TweetGen, a socio-technical framework for engagement-optimized short text generation, combining generative large language models (LLMs), semantic classification, and reinforcement learning (RL). Tailored specifically to the context of NFT communications on Platform X (Twitter), RL-TweetGen bridges the gap between automated content generation and audience-specific engagement needs. The system leverages state-of-the-art instruction-tuned models—LLaMA-3.1-8B, Mistral-7B Instruct, and DeepSeek LLM 7B Chat—to produce diverse, context-aware, and stylistically adaptive tweets. Unlike prior efforts that primarily addressed generic tweet generation, RL-TweetGen contributes a domain-specialized RL framework that integrates three core innovations: LSC Variation—a structured method for controlling tweet Length, Style, and Context, enabling high diversity and alignment with audience expectations; Domain-Tuned Generative Modeling, utilizing Parameter-Efficient Fine-Tuning (PEFT via LoRA) to adapt LLMs to NFT-specific language and themes; and Engagement-Aware Reinforcement Learning, which uses a blended reward function (XGBoost-predicted engagement scores and human feedback) to fine-tune tweet outputs for higher interaction potential.

Empirical evaluation demonstrates that RL-TweetGen consistently outperforms baseline zeroshot generation models across both lexical (BLEU, METEOR, ROUGE) and semantic (BERTScore, Cosine Similarity) metrics. It shows practical utility for NFT marketers, creators, and Web3 platforms seeking scalable, data-driven, and engagement-optimized communication strategies.

Author Contributions

Conceptualization, P.S.S. and C.S.; Data curation, P.S.S.; Methodology, P.S.S.; Software, P.S.S.; Supervision, C.S.; Visualization, P.S.S.; Writing—original draft, P.S.S.; Writing—review and editing, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported through the Anna Centenary Research Fellowship (ACRF) Scheme of the Centre for Research, Anna University, India (Grant No: CFR/ACRF-2021/AR1). The authors express their sincere gratitude for this support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are available upon request.

Acknowledgments

The authors extend their thanks to the Centre for Research, Anna University, for providing financial assistance through the Anna Centenary Research Fellowship (ACRF) scheme.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Amodei, D. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Le Scao, T.; Fan, A.; Akiki, C.; Pavlick, E.; Ilic, S.; Hesslow, D.; Al-Shaibani, M.S. BLOOM: A 176B-parameter open-access multilingual language model. arXiv 2023, arXiv:2211.05100. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Scialom, T. LLaMA 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Sayed, W.E. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar] [CrossRef]
Bi, X.; Chen, D.; Chen, G.; Chen, S.; Dai, D.; Deng, C.; Zou, Y. DeepSeek LLM: Scaling open-source language models with longtermism. arXiv 2024, arXiv:2401.02954. [Google Scholar]
Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Kaplan, J. Constitutional AI: Harmlessness from AI feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
Nguyen, D.Q.; Vu, T.; Nguyen, A.T. BERTweet: A pre-trained language model for English Tweets. arXiv 2020, arXiv:2005.10200. [Google Scholar] [CrossRef]
Zhang, X.; Malkov, Y.; Florez, O.; Park, S.; McWilliams, B.; Han, J.; El-Kishky, A. TwHIN-BERT: A socially-enriched pre-trained language model for multilingual tweet representations at Twitter. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23), Long Beach, CA, USA, 6–10 August 2023; pp. 5597–5607. [Google Scholar]
Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv 2014, arXiv:1412.3555. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Yu, K.; Zhao, Z.; Wu, X.; Lin, H.; Liu, X. Rich short text conversation using semantic-key-controlled sequence generation. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1359–1368. [Google Scholar] [CrossRef]
Zhang, Y.; Mao, W.; Lin, J. Modeling topic evolution in social media short texts. In Proceedings of the 2017 IEEE International Conference on Big Knowledge (ICBK), Hefei, China, 9–10 August 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 315–319. [Google Scholar]
Gao, W.; Peng, M.; Wang, H.; Zhang, Y.; Han, W.; Hu, G.; Xie, Q. Generation of topic evolution graphs from short text streams. Neurocomputing 2020, 383, 282–294. [Google Scholar] [CrossRef]
Bagavathi, C.; Prakash, A. Neuro-Evolution-Based Language Model for Text Generation. In Computational Intelligence in Data Science. ICCIDS 2024; Owoc, M., Varghese Sicily, F., Rajaram, K., Balasundaram, P., Eds.; Springer: Cham, Switzerland, 2024; Volume 717, pp. 125–141. [Google Scholar] [CrossRef]
Bayer, M.; Kaufhold, M.A.; Buchhold, B.; Keller, M.; Dallmeyer, J.; Reuter, C. Data augmentation in natural language processing: A novel text generation approach for long and short text classifiers. Int. J. Mach. Learn. Cybern. 2023, 14, 135–150. [Google Scholar] [CrossRef] [PubMed]
Ling, C.; Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.; Zhao, L. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv 2023, arXiv:2305.18703. [Google Scholar]
Balaskas, G.; Papadopoulos, H.; Pappa, D.; Loisel, Q.; Chastin, S. A Framework for Domain-Specific Dataset Creation and Adaptation of Large Language Models. Computers 2025, 14, 172. [Google Scholar] [CrossRef]
Hu, J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar] [CrossRef]
Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.P.; Bing, L.; Lee, R.K.W. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models. arXiv 2023, arXiv:2304.01933. [Google Scholar]
Parthasarathy, V.; Zafar, A.; Khan, A.; Shahid, A. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv 2024, arXiv:2408.13296. [Google Scholar] [CrossRef]
SeraJ, J.; Mohajeri, M.M.; Dousti, M.J. D²LoRA: Data-Driven LoRA Initialization for Low Resource Tasks. arXiv 2025, arXiv:2503.18089. [Google Scholar]
Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 26–30 April 2020; Available online: https://openreview.net/forum?id=SkeHuCVFDr (accessed on 1 July 2025).
Sellam, T.; Das, D.; Parikh, A. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7881–7892. [Google Scholar] [CrossRef]
Fogg, B.J. Persuasive Technology: Using Computers to Change What We Think and Do. Ubiquity 2002, 2002, 5. [Google Scholar] [CrossRef]
Binns, R. Fairness in Machine Learning: Lessons from Political Philosophy. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT 2018), New York, NY, USA, 23–24 February 2018; PMLR: New York, NY, USA, 2018; Volume 81, pp. 149–159. Available online: http://proceedings.mlr.press/v81/binns18a.html (accessed on 1 July 2025).
Gulcehre, C.; Paine, T.L.; Srinivasan, S.; Konyushkova, K.; Weerts, L.; Sharma, A.; de Freitas, N. Reinforced self-training (ReST) for language modeling. arXiv 2023, arXiv:2308.08998. [Google Scholar]
Kapoor, A.; Guhathakurta, D.; Mathur, M.; Yadav, R.; Gupta, M.; Kumaraguru, P. TweetBoost: Influence of social media on NFT valuation. In Proceedings of the Companion Proceedings of the Web Conference 2022 (WWW ’22 Companion), Lyon, France, 25–29 April 2022; ACM: New York, NY, USA, 2022; pp. 621–629. [Google Scholar]
Qian, C.; Mathur, N.; Zakaria, N.H.; Arora, R.; Gupta, V.; Ali, M. Understanding public opinions on social media for financial sentiment analysis using AI-based techniques. Inf. Process. Manag. 2022, 59, 103098. [Google Scholar] [CrossRef]
Chen, H.; Cai, W. How Information Manipulation on Social Media Influences the NFT Investors’ Behavior: A Case Study of Goblintown.Wtf. IEEE Trans. Comput. Soc. Syst. 2024, 11, 5038–5049. [Google Scholar] [CrossRef]
Brahmstaedt, K. Community and Consumer Dynamics in NFTs: Understanding Digital Asset Value through Social Engagement. J. Consum. Behav. 2025, 24, 1630–1655. [Google Scholar] [CrossRef]
Kaggle. NFT Tweets Dataset: Verified NFT Tweets for Sentiment Analysis and Research. 2022. Available online: https://www.kaggle.com/datasets/adanai/verified-nft-tweets (accessed on 1 July 2025).
Kowsari, K.; Meimandi, K.J.; Heidarysafa, M.; Mendu, S.; Barnes, L.; Brown, D. Text classification algorithms: A survey. Information 2019, 10, 150. [Google Scholar] [CrossRef]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal Policy Optimization Algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]

Figure 1. The overall architecture of the proposed RL-TweetGen system.

Figure 2. LoRA-based Parameter-Efficient Fine-Tuning (PEFT) strategy for adapting pretrained LLMs to NFT-specific tweet generation.

Figure 3. Visualization of key fine-tuning indicators for Mistral-7B: (a) learning rate evolution, (b) fluctuations in gradient norm, and (c) loss reduction over time.

Figure 4. Visualization of key fine-tuning indicators for LLaMA 8B—(a) learning rate evolution, (b) fluctuations in gradient norm, and (c) loss reduction over time.

Figure 5. Visualization of key fine-tuning indicators for Deepseek-7B—(a) learning rate evolution, (b) fluctuations in gradient norm, and (c) loss reduction over time.

Figure 6. Performance comparison of various classifiers based on accuracy, precision, recall, and F1-score for semantic classification of NFT tweets. Gradient Boosting outperformed all other models.

Table 1. Dataset preprocessing steps and characteristics.

Step/Category	Description	Tweet Count
Initial Load	Raw tweets from Kaggle dataset	100,323
Duplicate Removal	Removed exact text duplicates	94,872
Short Tweet Filter	Removed tweets shorter than 20 characters	94,553
Language and Text Cleaning	Retained only English tweets and removed garbled/unreadable text	72,233
Final Stratified Sample	Sampled for modeling based on engagement level distribution	2134
Engagement-Level Distribution	Low: 725, Medium: 710, High: 699	Total: 2134
Dataset Split Ratios	Training: 1707 (80%), Validation: 213 (10%), Test: 214 (10%)	Total: 2134
Metadata Summary	Tweet ID, timestamp, language, content, mentions, hashtags, cashtags, likes, retweets, replies, quoted/mentioned users, media flags	—

Table 2. Comparison of Mistral 7B Instruct v0.1, LLaMA 3.1 8B Instruct, and DeepSeek LLM 7B chat models.

Feature	Mistral 7B Instruct V0.1	LLaMA 3.1 8B Instruct	DeepSeek LLM 7B Chat
Model Type	Decoder only Transformer (Autoregressive + Instruction Tuning Friendly)	Decoder only Transformer (Autoregressive + Instruction Tuning Friendly)	Decoder only Transformer (Autoregressive + Instruction Tuning Friendly)
Parameter Count	7 Billion	8.03 Billion	7 Billion
Architecture	32 layers, 4096 hidden size	32 layers, 4096 hidden size	30 layers, 4096 hidden size
MLP Dimension	14,336	14,336	11,008
Attention Heads	32	32	32
Attention Mechanism	GQA, SWA, FlashAttention v2, RoPE	Multi-head Attention, RoPE	Multi-head Attention, RoPE
RoPE Theta	10,000	500,000	10,000
Max Positional Embeddings	4096 (via Sliding Window Attention)	131,072 (via RoPE + sliding window)	Not explicitly stated
Sliding Window Attention (SWA)	Yes (4096 tokens)	Yes	Not specified
Grouped Query Attention (GQA)	Yes	No	No
Layer Normalization	Pre-layer normalization	Post-layer normalization	Pre-layer normalization
LoRA Compatible	Yes	Yes	Yes

Table 3. XGBoost engagement model performance and features.

Metric	Value
$R^{2}$ Score	0.823
RMSE (normalized)	0.041
Log_1p Transformation	Applied
Top Features	Tweet length, Word count, Hashtag presence, TF-IDF terms

Table 4. Performance comparison of different classification models.

Model	Accuracy	Precision	Recall	F1-Score
Logistic Regression	0.61	0.51	0.61	0.50
Linear SVM	0.68	0.60	0.68	0.63
SGD Classifier	0.70	0.65	0.70	0.67
Random Forest	0.66	0.65	0.67	0.58
Decision Tree	0.69	0.67	0.69	0.68
Gradient Boosting	0.76	0.76	0.76	0.75

Table 5. Performance comparison of pretrained, fine-tuned, and RL-enhanced models on lexical overlap metrics.

Model Type	Model Name	BLEU	METEOR	ROUGE-1	ROUGE-2	ROUGE-L
Base Pretrained Models (Zeroshot)	Mistral	0.2144	0.3120	0.3362	0.2178	0.3137
	LLaMA	0.2058	0.3113	0.3282	0.2350	0.3168
	DeepSeek	0.2044	0.2397	0.2422	0.2008	0.2329
Fine-Tuned Models with LoRA	Mistral	0.2240	0.3164	0.3484	0.2409	0.3366
	LLaMA	0.211	0.3103	0.3368	0.2521	0.3315
	DeepSeek	0.2051	0.2332	0.2471	0.2118	0.2394
Fine-Tuned + Engagement-Optimized RL Models	Mistral	0.2285	0.311	0.3542	0.2357	0.3413
	LLaMA	0.2075	0.3169	0.3305	0.2588	0.3281
	DeepSeek	0.2108	0.2285	0.2525	0.2074	0.2341

Table 6. Performance comparison of pretrained, fine-tuned, and RL-enhanced models on semantic similarity metrics.

Model Type	Model Name	BERT_P	BERT_R	BERT_F1	Cosine Similarity
Base Pretrained Models (Zeroshot)	Mistral	0.8356	0.8137	0.8240	0.5023
	LLaMA	0.8092	0.8076	0.8084	0.6860
	DeepSeek	0.7988	0.7823	0.7902	0.4658
Fine-Tuned Models with LoRA	Mistral	0.8454	0.8407	0.8323	0.4813
	LLaMA	0.7969	0.8242	0.8103	0.6794
	DeepSeek	0.7844	0.7885	0.7856	0.4735
Fine-Tuned + Engagement-Optimized RL Models	Mistral	0.8481	0.8455	0.8461	0.4861
	LLaMA	0.8032	0.8187	0.8155	0.7173
	DeepSeek	0.7792	0.7941	0.7907	0.4682

Table 7. James Jean Black Friday NFT drop.

Model Stage	Generated Tweet Output	Expert Observation
Base Tweet	🚨 New NFT Art Drop Alert! Check out the latest from James Jean for #BlackFriday! Featuring stunning painting that’s sure to impress. Don’t miss this opportunity.	Shortcomings: Misses key promotional detail (“$500 OFF”). Uses vague language (“stunning painting”) with limited uniqueness. Weak call-to-action; does not mention platform (e.g., OpenSea).
Fine-Tuned Tweet	🎨 $500 OFF this Black Friday! James Jean’s stunning NFT collection is now live on OpenSea. 🔥 #BlackFriday #art #nft #painting #jamesjean	Enhancements: Effectively incorporates the discount and platform info. Builds excitement using emojis and targeted phrasing. Could further improve by adding seasonal or gifting context.
Fine-tuned + Engagement-Optimized Tweet	🎁 $500 OFF Black Friday Deal! 🎨 Own a James Jean masterpiece in NFT form—now live on OpenSea! 🔥 A perfect gift for art lovers this holiday season. Check it out: https://t.co/efCdhBXAfL (accessed on 21 July 2025) #BlackFriday #art #painting #jamesjean #nft #holidaygiftguide #deals	Most Comprehensive and Impactful: Clearly emphasizes urgency and promotional value. Adds emotional resonance with seasonal and gifting language. Balances structure, visual elements, and call-to-action effectively.

Table 8. Ugonzo x Stephen NFT Collab drop.

Model Stage	Generated Tweet Output	Expert Observation
Base Tweet	🎉 Surprise NFT collab drop! @Ugonzo_art x @Stephen_Animate just dropped a new piece of digital art. Check it out now! #nft #nftcommunity #nftcollectors #pinknfts	Shortcomings: Generic phrasing; lacks urgency and exclusivity. Does not mention platform (e.g., OpenSea). CTA is weak and lacks emotional pull.
Fine-Tuned Tweet	🚨 Attention all #NFT enthusiasts! 🎉 @Ugonzo_art & @Stephen_Animate are dropping a one-of-a-kind NFT art collab TONIGHT! #nft #nftcollectors #pinknfts #nftcommunity	Enhancements: Builds urgency (“TONIGHT”), exclusivity, and community appeal. Effectively includes platform and thematic tags. Could benefit from more emotional or visual storytelling.
Fine-tuned + Engagement-Optimized Tweet	🎨 Just dropped: An exclusive NFT collab by @Ugonzo_art & @Stephen_Animate! Dive into the world of PINK like never before—live now on OpenSea! 🔥 Own a piece of digital brilliance before it’s gone! #nft #pinknfts #nftcollectors #nftcommunity	Most Impactful and Engaging: Strong narrative and emotional appeal (“digital brilliance”, “before it’s gone”). Clearly communicates exclusivity and urgency. Balanced tone, vivid visuals, and compelling CTA make it highly effective.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

RL-TweetGen: A Socio-Technical Framework for Engagement-Optimized Short Text Generation in Digital Commerce Using Large Language Models and Reinforcement Learning

Abstract

1. Introduction

Research Contributions

2. Related Work

2.1. Evolution of Short-Text Generation

2.2. Large Language Models and Domain Adaptation

2.3. Reinforcement Learning for Controlled and Goal-Aware Generation

2.4. NFTs, Social Platforms, and Community Behavior

2.5. Toward an Integrated Framework for Engagement-Optimized Generation

3. Methodology

3.1. Dataset Acquisition, Characteristics, and Preprocessing

3.1.1. Preprocessing Pipeline

3.1.2. Dataset Statistics

3.2. Engagement Score Computation and Labeling

3.3. Tweet Semantic Classification

3.4. Intent-Aware Prompt Generation

3.5. Model Selection Overview

3.5.1. Mistral-7B Instruct V0.1

3.5.2. LLaMA-3.1-8B Instruct

3.5.3. DeepSeek LLM 7B Chat

3.6. Baseline Generation (Zeroshot LLM)

3.7. Supervised Model Fine-Tuning Using PEFT with LoRA

3.7.1. Base Model and Tokenizer Initialization

3.7.2. Dataset Loading and Preprocessing

3.7.3. Base Models and Checkpoints

3.7.4. LoRA Configuration and Target Modules

3.7.5. Training Configuration and Steps

3.7.6. Checkpointing and Artifacts

3.7.7. Inference and NFT Tweet Generation

3.8. Diverse Generation with LSC Variation

3.9. Engagement Scoring Model

3.10. Reinforcement Learning Algorithm for Engagement-Optimized Tweet Generation

3.10.1. Reinforcement Learning Optimization

3.10.2. Tweet-Generation Environment

3.10.3. Blended Reward Function

3.10.4. Advantage-Based Policy Update

3.10.5. Engagement-Aware Decoding Strategies

3.10.6. Evaluation and Iteration

4. Experimental Procedure

4.1. System Configuration

4.2. Dataset Preparation

4.3. Model Configuration

4.4. Evaluation and Feedback Loop

4.5. Statistical and Reliability Analysis

5. Results

5.1. Comparative Analysis for Tweet Semantic Classification

5.2. Lexical Performance Evaluation: Metrics and Results

5.2.1. BLEU Score

5.2.2. METEOR Score

5.2.3. ROUGE Metrics

5.3. Semantic Performance Evaluation: Metrics and Results

5.3.1. BERTScore

5.3.2. Cosine Similarity

5.4. Performance Analysis and Comparative Evaluation

6. Discussion

6.1. Interpretation of Key Findings

6.2. Generalizability and Domain Transfer

6.3. Ethical Implications and Societal Impact

6.4. Limitations and Future Work

7. Testcases

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics