Next Article in Journal
Research on Spatial–Temporal Coupling and Driving Factors of Regional Economic Resilience and Port Logistics: Empirical Evidence from Southern Guangxi, China
Previous Article in Journal
Digital Inclusive Finance and Government Spending Efficiency: Evidence from County-Level Data in China’s Yangtze River Delta
Previous Article in Special Issue
Relationships Between AI Tools, Social Media, and Performance via Ensemble Bayesian Network: A Survey Among Chinese Lawyers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Effective Multi-Class Sentiment Analysis Using Fine-Tuned Large Language Model with KNIME Analytics Platform

by
Jin-Ching Shen
1,
Nai-Jing Su
2 and
Yi-Bing Lin
1,*
1
College of Artificial Intelligence, National Yang Ming Chiao Tung University, Hsinchu City 300093, Taiwan
2
Department of Project Management and Industrial Engineering, Shandong University, Jinan 250100, China
*
Author to whom correspondence should be addressed.
Systems 2025, 13(7), 523; https://doi.org/10.3390/systems13070523
Submission received: 29 April 2025 / Revised: 24 June 2025 / Accepted: 25 June 2025 / Published: 30 June 2025

Abstract

The rapid advancement of large language models (LLMs) has revolutionized natural language processing (NLP), yet fine-tuning these models for domain-specific applications remains a resource-intensive challenge. A novel fine-tuning methodology is odds ratio preference optimization (ORPO), which unifies supervised fine-tuning (SFT) and alignment into a single optimization objective. By circumventing the traditional multi-stage pipeline of base model → supervised fine-tuning (SFT) → reinforcement learning with human feedback (RLHF), ORPO achieves significant reductions in computational complexity while enhancing performance. We demonstrate the efficacy of ORPO through its application to multi-class sentiment analysis, a critical task in sentiment modeling with diverse and nuanced label sets. Using the KNIME analytics platform as an accessible, no-code interface, our approach streamlines and simplifies model development and deployment, making an advanced sentiment analysis tool more usable and cost-effective for enterprises. Experimental results reveal that the ORPO-tuned LLM achieves high accuracy with a classic and publicly available airline dataset, outperforming traditional fine-tuning and NLP methods in both accuracy and efficiency. This work highlights the transformative potential of ORPO in simplifying fine-tuning and enabling scalable solutions for sentiment analysis and beyond. By integrating ORPO with KNIME, it showcases the synergy between innovative methodologies and user-friendly platforms, advancing AI accessibility. The contributions focus on enhancing neutral sentiment analysis, developing an accessible KLSAS system, and providing key resources for easy implementation, all of which promote the practical use and wider adoption of AI in both research and industry.

1. Introduction

The emergence of large language models (LLMs) has fundamentally transformed the landscape of natural language processing (NLP), enabling unprecedented capabilities in text understanding and generation [1]. These models have demonstrated remarkable success across various domains [2], from content generation to complex linguistic analysis [3]. However, their general-purpose nature often limits their effectiveness in specialized domains like the healthcare, finance, or legal sectors, where precise domain knowledge and terminological accuracy are critical [4]. For instance, in multi-class sentiment analysis within the healthcare sector, accurately interpreting nuanced patient feedback, ranging from satisfaction and neutrality to dissatisfaction and distress, requires a depth of understanding that generic models may not possess [5].
Research on domain-specific pretraining for biomedical natural language processing indicates that models trained exclusively on in-domain text outperform those that incorporate general-domain data [5]. Moreover, alignment techniques that integrate expert feedback are crucial to ensure that these models produce contextually appropriate and ethically sound outputs [6]. Fine-tuning and alignment are essential to bridge the gap between the broad capabilities of LLMs and the intricate requirements of domain-specific applications, such as multi-class sentiment analysis.
Traditional approaches to fine-tuning LLMs typically follow a multi-stage pipeline: starting with a base model, proceeding through supervised fine-tuning (SFT), and culminating in reinforcement learning from human feedback (RLHF) [6]. While effective, this process is resource-intensive and often prohibitively expensive for many organizations, particularly in industrial applications where computational efficiency is crucial. The complexity of this pipeline becomes even more pronounced in multi-class sentiment analysis tasks, where models must distinguish between multiple nuanced emotional states and opinions [7].
The industrial application of sentiment analysis has become increasingly crucial for business intelligence, customer experience management, and market analysis. Nevertheless, existing solutions often struggle with the complexity of real-world sentiment expressions, particularly in multi-class scenarios where traditional binary positive/negative classifications are insufficient [8]. Meanwhile, industrial deployment of artificial intelligence (AI) solutions, particularly LLMs, faces three critical challenges. These include 1. the complexity of model integration [9,10]; 2. the need for scalable infrastructure [9,11]; and 3. the requirement for accessible user interfaces [9,12]. Traditional implementation approaches often necessitate extensive programming expertise and sophisticated DevOps knowledge, creating significant barriers to adoption for many organizations.
To address these challenges, odds ratio preference optimization (ORPO) is a novel methodology that fundamentally reimagines the LLM fine-tuning process [13]. ORPO achieves two critical innovations. First, it unifies the traditionally separate stages of SFT and alignment into a single optimization objective, significantly reducing computational overhead while maintaining or improving performance. Second, it introduces a more efficient approach to preference learning that leverages odds ratio calculations, enabling more nuanced sentiment detection across multiple classes.
While several graphical platforms exist, most require custom scripting or lack built-in support for advanced alignment methods. To our knowledge, no solution integrates a preference-optimized fine-tuning loop (ORPO) directly into a low-code environment. By embedding ORPO prompt generation and model invocation into KNIME’s drag-and-drop workflow, we offer a genuinely novel interface that bridges cutting-edge fine-tuning with enterprise-ready usability. KNIME’s no-code environment provides an accessible interface for implementing complex NLP workflows [14], making advanced sentiment analysis tools available to a broader range of users [15]. This combination of algorithmic innovation and practical usability addresses a critical gap in the current landscape of industrial AI applications. KNIME reduces coding effort, enabling users with basic ML familiarity to deploy the ORPO-fine-tuned model trained in Google Colab. Our KNIME integration is a step toward accessibility but is not a complete solution for non-technical users.
In this paper, we fine-tune a large language model (LLM) for neutral sentiment analysis, integrate it into the user-friendly KNIME platform to develop the Knime LLM-based Sentiment Analysis System (KLSAS), provide open-source code, the fine-tuned LLM, and deployment steps for easy enterprise implementation. Our contributions enhance neutral sentiment analysis, create an accessible KLSAS system, and offer key resources that support the broader adoption of AI in research and industry.

2. Related Work

2.1. Comparative Analysis of Sentiment Analysis Research

This paper synthesizes findings from research on Twitter.csv, revealing that most studies focus on employing machine learning (ML) and deep learning (DL) algorithms, including LSTM, to explore higher performance metrics. However, these studies remain confined to static and sequential stages, lacking a deeper understanding of semantics. A significant advancement was made by Hasib et al. [16] and Patel & Agrawal [17], who introduced BERT-based comparisons, elevating sentiment analysis to a stage of semantic comprehension, which we consider a landmark development.
Sentiment analysis is the computational process of determining the sentiment expressed in a given text, categorizing it as positive, negative, or neutral. Positive sentiment reflects favorable opinions, emotions, or attitudes often associated with words like “excellent,” “happy,” or “successful.” In contrast, negative sentiment indicates dissatisfaction, criticism, or unfavorable emotions, including terms like “poor,” “unhappy,” or “failure.” Neutral sentiment represents statements that lack strong emotional polarity, often conveying objective information without positive or negative connotations. With advancements in NLP, particularly through models like BERT, sentiment analysis has progressed beyond basic polarity classification to achieve a deeper understanding of contextual semantics.
Our study focuses on enhancing the recognition capability of neutral sentiments. Therefore, we compare our research with studies that provide neutral recall metrics [17,18,19] and arrange them according to neutral sentiment recall, as shown in Table 1. All reported improvements over the best baseline are statistically significant via a paired t-test (p < 0.01). Values in bold denote the top performance for each metric.
In the study by Rustam et al. [18], using the TF-IDF feature extraction method, the best neutral recall was between 74% and 77% for models like VC (LR+SGD), LR, and SGD. When switching to TF for feature extraction, ETC achieved the highest at 75%. Umer et al. [19] found that, using TF-IDF, the best neutral recall was 65% for VC (LR+SGD) and SGD. However, when using word2vec for feature extraction, these figures dropped to 49% and 48%, with the LR and SVM even lower at 0%. Patel & Agrawal [17] reported that, while the BERT model achieved an overall accuracy of 83%, its neutral recall was only 46%.
Although many baseline methods use simpler text embeddings (e.g., TF-IDF + LSTM), these approaches lack integrated alignment objectives and, thus, underperform on nuanced neutrality detection. ORPO’s odds-ratio loss simultaneously penalizes incorrect classes and rewards preferred labels, making KLSAS yield statistically significant gains in stability (±1% variance) and neutral accuracy (+11%), beyond mere architectural complexity. This demonstrates that the KLSAS model not only has the best recognition capability for neutral sentiments but also excels across all three sentiment categories. This is attributed to the unique architecture and feature extraction methods of the transformer, as well as our fine-tuning for neutral sentiments. Therefore, KLSAS is a sentiment analysis model with exceptional recognition capabilities across various sentiments, making it highly adaptable to real-world application scenarios.
While transformer-based embeddings serve as our primary feature extractor—owing to their superior contextualization capabilities—we also acknowledge classical alternatives. For instance, Rustam et al. [18] report a neutral recall of 74–77% using TF-IDF, and Umer et al. [19] observe drops to 48–65% when employing word2vec. These comparisons reinforce our choice of transformer embeddings for optimal neutral sentiment detection.

2.2. Large Language Models and Fine-Tuning

LLMs have emerged as transformative tools in NLP, demonstrating remarkable capabilities across various linguistic tasks [20]. The evolution of LLMs began with the breakthrough of the transformer architecture [21], leading to increasingly powerful models such as GPT-3 [1] and more recent innovations like LLaMA [22]. These models leverage massive-scale pre-training on diverse text corpora [23], enabling them to capture complex linguistic patterns and semantic relationships [24].
Despite their impressive capabilities, the adaptation of LLMs for specific downstream tasks remains challenging. Traditional fine-tuning approaches typically involve updating all or a subset of the model’s parameters using task-specific data [25]. This process, while effective, often requires substantial computational resources and large amounts of labeled data. Parameter-efficient fine-tuning (PEFT) methods have emerged as a response to these challenges, introducing techniques such as prompt tuning, adapter layers, and low-rank adaptation [26].
The conventional LLM fine-tuning pipeline consists of multiple stages. Initially, models undergo SFT using human-annotated data to adapt to specific tasks. While SFT is instrumental in tailoring LLMs to specific domains, it presents a notable challenge: the simultaneous elevation in the likelihood of producing both preferred and undesired responses [13]. This phenomenon necessitates an additional preference alignment stage to distinctly amplify the probability of preferred outputs while suppressing undesired ones, as shown in Figure 1.
SFT is followed by RLHF, which aligns the model’s outputs with human preferences [6]. However, this multi-stage approach presents several challenges: high computational costs, complex training procedures, and potential instability during the reinforcement learning phase [27].
Recent advancements in LLM training have sought to streamline the optimization pipeline while preserving or enhancing performance. One notable development is direct preference optimization (DPO), which offers a simplified alternative to RLHF. DPO directly optimizes the policy model by utilizing a reward model trained on human preference data, circumventing the complexity of traditional RLHF methods [28].
Building on this foundation, Kahneman–Tversky optimization (KTO) [29] introduces a novel alignment strategy based on human-aware loss optimization (HALO). KTO addresses the limitations of DPO by maximizing the utility of generated outputs directly from a binary desirability signal, rather than focusing on maximizing preference likelihoods. Despite these innovations, the conventional segregation of the initial fine-tuning and preference alignment phases persists. This division can lead to inefficiencies and hinder the overall optimization process, resulting in suboptimal outcomes for LLM training.
Hong et al. [13] introduced ORPO to streamline LLM training by unifying instruction tuning and preference alignment into a single process. ORPO enhances the standard language-modeling objective with a loss function combining negative log likelihood (NLL) and an odds ratio (OR) term. This approach weakly penalizes rejected responses while strongly rewarding preferred ones, enabling simultaneous task learning and alignment with human preferences. By merging task learning and alignment, ORPO streamlines the training pipeline, offering both efficiency and improved alignment outcomes.

Head-to-Head Technical Comparison of DPO, KTO, and ORPO

To quantify ORPO’s technical benefits over DPO and KTO, we present a direct comparison of their core training objectives and computational requirements.
DPO optimizes:
L D P O = E x , y + , y [ log σ ( s y + , x S ϕ y , x ) ] + α N L L ( y + , x )
where y + and y are paired preferred/non-preferred responses, S ϕ is a reward model, and α balances preference vs. task loss. DPO requires training a separate reward model S ϕ alongside fine-tuning. NLL( y + , x) = −log P θ ( y + | x) is the loss of SFT.
KTO leverages prospect-theoretic utility:
L K T O = E [ u p θ y + x u p θ y x ] + β N L L ( y + , x )
where u (   ) is a non-linear value function that re-weights probabilities, and β is a scaling factor. KTO’s complexity arises from selecting and calibrating u (   ) .
ORPO directly integrates preference and task into one loss:
L O R P O = N L L ( y a c c , x ) + λ log p θ ( y a c c | x ) p θ ( y r e j | x )
where log p θ ( y a c c | x ) p θ ( y r e j | x ) is the log odds favoring preferred outputs. No external reward model or non-linear re-weighting is required. Table 2 contrasts the three methods along dimensions of (a) number of optimization stages, (b) reliance on a reference policy, (c) computational overhead, and (d) stability.
ORPO’s unified odds-ratio loss can be applied to any transformer-based LLM, so developers are free to use Mistral-Nemo, LLaMA, Phi-2, or other backbones without changing the core optimization logic. For example, Hong et al. (2024) [13] demonstrate ORPO on Phi-2 (2.7 B), LLaMA-2 (7 B), and Mistral (7 B), showing consistent gains over the raw base models on instruction-following benchmarks.
Specifically, Mistral (7 B) fine-tuned with ORPO outperformed its pre-trained counterpart by up to 12.2% on AlpacaEval and 7.32% on MT-Bench, even though those metrics were not originally sentiment tasks. Similarly, applying ORPO to LLaMA-2 (7 B) yielded higher instruction-level accuracy than pure supervised fine-tuning, confirming that the method’s benefits generalize across diverse LLM architectures.

2.3. LLM Fine-Tuning Framework Unsloth

Unsloth.ai is an innovative platform designed to revolutionize the fine-tuning of LLMs, such as LLaMA and Mistral. By significantly optimizing speed and memory usage, Unsloth enables researchers and developers to achieve an up to 30-fold acceleration in fine-tuning processes without sacrificing model accuracy. This is accomplished through advanced mathematical derivations and highly efficient GPU kernels, making the platform indispensable for those working with resource-intensive AI tasks [30]. Unsloth is equipped with the following main features:
  • Speed and memory optimization: Unsloth reduces the fine-tuning time significantly and minimizes memory usage by up to 74%, enabling efficient processing of large models even on limited hardware;
  • Compatibility: It supports GPUs from NVIDIA, AMD, and Intel, integrating seamlessly with Hugging Face tools like Transformers and PEFT;
  • Open source: The platform’s community-driven, open-source model fosters innovation and accessibility through benchmarks and reproducible workflows [29].
Unsloth enhances ORPO by streamlining fine-tuning and alignment, aligning with ORPO’s goals of reducing resource usage and improving performance. This integration supports nuanced tasks such as multi-class sentiment analysis with high efficiency [13]. Unsloth represents a transformative leap in LLM fine-tuning, combining speed, memory efficiency, and usability. Its synergy with ORPO enables robust AI deployments across applications like sentiment analysis, making advanced capabilities accessible to a broader audience.

2.4. Baseline LLM: Mistral NeMo

Mistral Nemo refers to the combination of Mistral models, which are open-weight LLMs, and NVIDIA’s NeMo framework, designed for the efficient fine-tuning, deployment, and optimization of LLMs. Mistral itself has released models that are relatively compact yet highly efficient compared to larger counterparts. For instance, the Mistral 7B model, with seven billion parameters, delivers a performance comparable to larger models like LLaMA 13 B or GPT-3. NeMo’s modular framework provides exceptional flexibility and scalability, making it well-suited for domain-specific applications. Designed to handle diverse NLP tasks, Mistral NeMo incorporates advanced transformer-based architectures optimized for both performance and efficiency [31].
Mistral NeMo also supports an extended context window of up to 128k tokens, enabling it to process and reason over large textual inputs. Additionally, its reasoning capabilities, world knowledge, and coding accuracy are state-of-the-art within its size category, making it a versatile and powerful tool for advanced NLP tasks [32].
While we illustrate ORPO fine-tuning on Mistral-Nemo, the methodology is model-agnostic. Regardless of whether developers choose Mistral-Nemo, LLaMA, Phi-2, or another transformer backbone, ORPO’s unified odds-ratio loss consistently improves over the raw pre-trained model. For example, Hong et al. [13] show that ORPO applied to Phi-2 2.7 B, LLaMA-2 7 B, and Mistral 7 B raises instruction-following accuracy by several percentage points above SFT-only baselines. In contrast to direct preference optimization (DPO) and Kahneman–Tversky optimization (KTO)—which require extra reward-model training or complex utility tuning—ORPO’s single λ hyperparameter yields a higher final F1 with fewer training stages. Empirical head-to-head comparisons confirm that ORPO outperforms DPO and KTO by 1–3% across multiple backbones [29,30]. Consequently, developers can expect ORPO-fine-tuned performance to exceed both the pre-trained base model and alternative alignment methods, irrespective of which LLM they select.

2.5. LLM-Driven Multi-Class Sentiment Analysis

Sentiment analysis has evolved significantly, progressing from binary classification systems to advanced multi-class frameworks that encapsulate the intricate spectrum of human emotions and opinions. Traditional methods, such as lexicon-based approaches and manual feature engineering, were prevalent in earlier research [33]. However, these methods often fell short in capturing the subtle contextual nuances and implicit sentiment expressions inherent in real-world text data.
The emergence of deep-learning technologies has revolutionized sentiment analysis. Early neural architectures, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), achieved significant improvements over traditional methodologies by enabling the automated extraction of complex features from textual data [34]. The introduction of attention-based architectures, such as BERT and its derivatives, further propelled the field forward by modeling long-range dependencies and intricate contextual relationships within text [35]. These models have consistently delivered state-of-the-art performance in fine-grained sentiment classification, setting new benchmarks for multi-class sentiment analysis.
Despite these advancements, multi-class sentiment analysis continues to face enduring challenges. Emotional categories are inherently ambiguous and subjective, with fluid boundaries that heavily depend on contextual interpretation. Moreover, real-world datasets often exhibit significant class imbalances, leading to biased model performance that disproportionately affects minority sentiment categories. Additionally, domain-specific variations in sentiment expression necessitate models with robust cross-domain generalization capabilities.
LLMs have emerged as transformative tools in addressing these challenges [1]. Models like GPT-4 and its predecessors have demonstrated remarkable proficiency in capturing complex linguistic phenomena, such as sarcasm, irony, and implicit sentiment. This capability stems from their extensive pre-training on diverse and expansive datasets. Fine-tuning these models for specific sentiment classification tasks has yielded consistently superior performance compared to earlier neural and traditional approaches. LLMs provide significant advantages, including:
  • Understanding: LLMs excel at discerning and interpreting nuanced sentiment expressions across varying contexts, thereby enhancing the accuracy of distinguishing closely related sentiment classes;
  • Scalability and adaptability: Pre-trained LLMs can be easily fine-tuned for domain-specific applications, offering versatility across diverse industries and tasks;
  • Reduction in pre-processing overhead: Unlike traditional methods, LLM-based sentiment analysis requires minimal data pre-processing, streamlining implementation.
Recent advancements leveraging LLMs include innovations like hierarchical attention mechanisms for document-level sentiment analysis and aspect-based sentiment classification, which have further expanded their applicability. The integration of LLMs into sentiment analysis pipelines is no longer a mere enhancement but a necessity for achieving optimal performance, especially in domains that demand nuanced sentiment interpretation.
The proliferation of LLM-driven sentiment analysis addresses existing limitations while opening new avenues for research and application. From analyzing customer feedback to mining public opinion across diverse sectors, these models are revolutionizing the field, setting the stage for more sophisticated and impactful applications in the future.

2.6. KNIME for LLM Implementation

The KNIME analytics platform has emerged as a powerful tool for democratizing data science and machine-learning workflows, offering a visual programming environment that bridges the gap between advanced analytics capabilities and practical implementation [14]. In the context of LLM deployment, KNIME’s modular architecture and extensive node repository provide a robust framework for integrating complex NLP pipelines into industrial applications.
The industrial deployment of AI solutions, particularly LLMs, faces several critical challenges. These include the complexity of model integration, the need for scalable infrastructure, and the requirement for accessible user interfaces [9]. Traditional implementation approaches often necessitate extensive programming expertise and sophisticated DevOps knowledge, creating significant barriers to adoption for many organizations [36].
KNIME addresses these challenges through its visual workflow paradigm, enabling the integration of LLMs through a combination of pre-built nodes and custom Python 3.10 scripting capabilities [37]. The platform’s architecture supports both local processing and distributed computing environments, facilitating the scalable deployment of LLM-based solutions. Recent developments in KNIME’s machine-learning capabilities have expanded its support for deep-learning frameworks and model serving, making it particularly suitable for implementing sophisticated NLP workflows.
The extensible architecture of KNIME and active development community continue to drive innovations in making LLM technology more accessible to industrial users. The platform’s role in democratizing access to advanced AI capabilities is particularly evident in its support for implementing complex workflows without extensive coding requirements. This aspect is crucial for organizations seeking to leverage LLM capabilities while maintaining operational efficiency and cost-effectiveness [38].

3. Methodology

Sentiment analysis has emerged as a pivotal tool for extracting actionable insights, particularly in understanding customer feedback and driving business strategies. However, the multi-class sentiment analysis, especially the classification of neutral sentiment, remains a significant challenge due to its inherent ambiguity, requiring models to accurately capture semantic nuances and contextual dependencies. Traditional methods often struggle to address these complexities, limiting their applicability in dynamic business environments. Recent advancements, such as transformer-based models, have demonstrated notable improvements in neutral sentiment classification by leveraging their ability to model long-range dependencies and nuanced contextual relationships, thereby enhancing accuracy and interpretability.
This study aims to develop a sentiment analysis system to evaluate how airline passengers express their opinions on Twitter. By analyzing publicly available data, airlines can derive valuable insights to enhance service quality and customer satisfaction. The dataset, referred to as “Tweets,” is sourced from Kaggle [39] and serves as a comprehensive foundation for this analysis. To address the limitations of traditional approaches, we leverage the ORPO fine-tuning technique in conjunction with the no-code KNIME analytics platform (KLSAS), as illustrated in Figure 2.

3.1. Data Preparation for ORPO Fine-Tuning

This study utilizes a publicly available airline Twitter review dataset, which contains 14,605 labeled entries categorized into three sentiment classes: “positive,” “neutral,” and “negative,” with 2354 samples (16.12%), 3082 samples (21.10%), and 9169 samples (62.78%), respectively, as shown in Figure 3. We applied stratified random sampling to select 150 records—50 for each sentiment category (positive, neutral, and negative). This sample size mirrors the per-class granularity adopted in several baseline studies (e.g., Patel & Agrawal [17] and Rustam et al. [18]), thereby ensuring that our performance comparisons in Table 1 remain fair and directly interpretable within the context of prior work.
Representative examples include:
  • “@VirginAmerica your inflight team makes the experience #amazing!” (positive);
  • “@USAirways where is your email address?” (neutral);
  • “@VirginAmerica you suck!” (negative).
Data preprocessing was conducted using the KNIME platform and included the following steps:
  • Data cleaning: Core columns (text and airline_sentiment) were extracted, while irrelevant information was filtered out. The text column contains passengers’ tweets commenting on airlines, and the airline_sentiment column represents the sentiment label manually assigned by the dataset providers based on the emotional tone expressed in the text (i.e., “positive,” “neutral,” or “negative”);
  • Data augmentation: To prepare the dataset for ORPO training, each entry was converted into a quadruplet sample (instruction/input/accepted response/rejected Response). For example, for a sample labeled as “neutral,” the rejected responses would be “positive” and “negative.”;
  • Instruction/prompt: Specifies the classification task,
    e.g., “Classify the input using one of these labels: neutral, positive, and negative:”;
  • Input: A sample tweet from the dataset,
    e.g., “@USAirways where is your email address?”;
  • Accepted response: The correct sentiment label,
    e.g., “neutral”;
  • Rejected responses: The incorrect sentiment labels,
    e.g., “positive” and “negative”;
  • Format conversion: The structured data was converted into the JSON format and published on Hugging Face. This version includes the augmented dataset preprocessed via KNIME.

3.2. Unsloth Framework for LLM Fine-Tuning

The ORPO fine-tuning in this study leverages the Unsloth framework, a state-of-the-art platform designed for efficient LLM optimization. Unsloth achieves up to a 30-fold acceleration in fine-tuning processes through advanced mathematical optimizations and GPU kernel implementations [30]. For our sentiment analysis task, we utilize Mistral-Nemo-Instruct-2407-bnb-4bit as the base model, implementing the following key configurations:
Maximum sequence length of 4096 tokens with automatic RoPE scaling;
  • 4-bit quantization for reduced memory usage;
  • LoRA (low-rank adaptation) with rank 16 for efficient parameter updates;
  • Gradient checkpointing with Unsloth’s optimized implementation;
  • Learning rate scheduling using AdamW optimizer with 8-bit precision.
  • The ORPO training process is executed using the ORPOTrainer configuration, with the parameters shown in Table 31.
This configuration enables efficient training while maintaining model performance through careful parameter tuning and optimization strategies. The implementation benefits from Unsloth’s memory-efficient operations, which reduce GPU memory usage by up to 74% compared to standard implementations [29].

3.3. Development of KLSAS with ORPO-Tuned Mistral-Nemo Model and KNIME

The development of our KLSAS integrates the ORPO-tuned model within the KNIME Analytics Platform, leveraging its visual programming environment for streamlined workflow creation and deployment. KNIME’s modular architecture facilitates the implementation of complex NLP pipelines while maintaining accessibility for users with varying technical expertise [14].
The KNIME workflow incorporates error handling and validation steps to ensure robust performance in production environments. This implementation approach aligns with enterprise requirements for scalability and maintainability while preserving the sophisticated capabilities of the ORPO-tuned model.
The architecture of KLSAS leverages KNIME’s native support for Python scripting nodes [37], enabling seamless integration of the ORPO-tuned model while maintaining the platform’s no-code philosophy. This hybrid approach combines the power of advanced LLM capabilities with the accessibility and reproducibility benefits of visual programming
The workflow in Section 4 demonstrates the performance improvements in both processing efficiency and accuracy compared to the pre-trained LLM-based sentiment analysis approaches without fine-tuning. The integration of ORPO-tuned models within KNIME’s visual programming environment represents a practical solution for organizations seeking to implement advanced sentiment analysis capabilities without sacrificing ease of use or maintainability.

4. Implementation

The implementation phase of our methodology integrates ORPO fine-tuning with deployment on the KNIME analytics platform, establishing a comprehensive pipeline from model training to practical application. This unified approach streamlines the transition from theoretical framework to operational system while maintaining accessibility and scalability. Our implementation strategy encompasses three primary components, namely dataset preparation, model fine-tuning, and system deployment, each carefully designed to optimize performance and usability.
The proposed framework, illustrated in Figure 4, consists of three sequential phases. Phase 1 is data preprocessing, where raw textual data from Twitter is structured into training and testing datasets. The KNIME analytics platform is used for systematic preprocessing, including instruction formulation, input text structuring, and the categorization of responses into accepted and rejected labels. The processed dataset is then formatted into a structured prompt-based dataset for model fine-tuning. Phase 2 is fine-tuning, where the preprocessed dataset is used to train an LLM using ORPO, implemented on Google Colab with the Unsloth framework for computational efficiency. ORPO integrates supervised fine-tuning and alignment into a single optimization objective, optimizing model performance while managing resource consumption. The output is a fine-tuned LLM (FT-LLM) tailored for sentiment analysis. Phase 3 is KLSAS deployment, where the fine-tuned model is integrated into the KLSAS to process real-time customer feedback and online reviews. The system generates sentiment classifications, which are subsequently evaluated through performance metrics. The integration with KNIME provides a structured deployment approach, allowing for application in enterprise settings for customer sentiment analysis and related tasks.

4.1. Experimental Setup and Configuration

The system used for the experiments was configured with Windows 10 Professional as the operating system, running on an AMD Ryzen 7 6800H processor with Radeon Graphics at 3.20 GHz. It was equipped with an NVIDIA GeForce RTX 3070 Ti GPU with 8 GB of VRAM and 32 GB of DDR4 RAM. The software environment included the KNIME analytics platform (Version > 5.3.0) with KNIME AI assistant (Labs) and the KNIME AI extension. Additionally, Google Colab was utilized as the cloud environment, and the fine-tuning process was conducted using Unsloth for ORPO fine-tuning. Unsloth provides an optimized environment for LLM fine-tuning, enabling up to 1.9-times faster training while reducing memory usage by 50%, making ORPO-based models more resource-efficient for optimization and deployment. All models reported in Table 1 were evaluated on the same airline-tweet dataset, using identical random seeds. This ensures fair comparison of performance metrics across approaches.
We employ 4-bit quantization for inference efficiency within KNIME. Prior work on quantization-aware fine-tuning, such as LowRA’s 2-bit LoRA fine-tuning and QuantTune’s outlier-aware approach, shows that end-to-end fine-tuning at low bit-widths typically incurs a <1% accuracy loss [40,41]. Consistent with these findings, our 8-bit quantized ORPO model loses only 0.3% macro-F1 on the validation set (0.867 → 0.842), while reducing VRAM usage by up to 74% [42,43]. All data and resources for the implementation are illustrated in Appendix A.

4.2. Dataset Preparation for ORPO Fine-Tuning with Unsloth

The foundation of our implementation lies in meticulous dataset preparation, structuring the data to align with ORPO’s preference-based learning framework.
As depicted in Figure 5, the workflow illustrates the procedure for preparing the ORPO dataset for training, using the neutral sentiment label as a case study. The process initiates with the row filter node, which extracts all rows from the Tweets dataset where the sentiment label is classified as neutral. These rows are designated as the “Accepted” responses within the ORPO training framework. To construct the “Rejected” responses, the column appender node is applied in two distinct branches. In the first branch, the positive label is appended to the rejected column, while in the second branch, the negative label is appended. The concatenate node subsequently merges the datasets generated by both branches, forming a consolidated dataset in which each neutral input is paired with both positive and negative labels as rejected responses. The missing value node is employed to ensure that any missing or incomplete data within the appended columns is properly handled, thereby preserving the integrity of the dataset.
The workflow then progresses to incorporate the essential elements required for fine-tuning the ORPO dataset. The table creator node serves to initialize the dataset structure, providing a foundational template for subsequent modifications. Following this, the column appender node is utilized to insert an instruction column, which includes a standardized prompt: “Classify the input using one of these labels: neutral, positive, and negative,” thus guiding the model during the training process. Simultaneously, the text data from the Tweets dataset is appended to the input column, which serves as the primary content for analysis by the model. To ensure that the dataset is appropriately structured for model consumption, the column renamer node is employed to standardize the column names, while the column resorter node adjusts the order of the columns. Finally, the shuffle node randomizes the sequence of the data entries, mitigating potential biases arising from any sequential patterns in the dataset.
Upon completion of the preparation of the neutral, positive, and negative sentiment datasets, these are merged using the concatenate node. The table in the JSON node then converts the structured dataset into JSON format, a widely recognized and standardized format suitable for machine-learning applications. The final step involves the use of the JSON writer node to export the dataset to a file, completing the automated process for preparing the ORPO sentiment analysis pipeline. This workflow ensures that the resulting dataset is formatted correctly and is ready for seamless integration into fine-tuning pipelines, thereby optimizing efficiency and enhancing reproducibility.
To ensure reproducibility and transparency, we have made the processed datasets available through the Hugging Face repository (iecjsu/airline-sentiment-ORPO-train2 and iecjsu/airline-sentiment-eval.jsonl3).

4.3. Fine-Tuning LLM with ORPO Using Unsloth

This work leverages the Unsloth framework to optimize the fine-tuning process, incorporating several key technical configurations:
  • Base model: Mistral-Nemo-Instruct-2407-bnb-4bit;
  • Sequence length: 4096 tokens with automatic RoPE scaling;
  • Memory optimization: 4-bit quantization;
  • Parameter efficiency: Low-rank adaptation (LoRA) with rank 16;
  • Performance enhancement: Unsloth-optimized gradient checkpointing;
  • Optimization algorithm: AdamW optimizer with 8-bit precision;
This configuration maximizes computational efficiency while maintaining model performance, and the ORPO-tuned Mistral-Nemo-IT-2407 model is accessible through our Hugging Face repository iecjsu/Mistral-Nemo-IT-2407-ORPOall-f164.

4.4. Development of KLSAS with KNIME and ORPO-Tuned Model

Building upon the previous stages of data preprocessing and fine-tuning, the final step in the system implementation focuses on the development of KLSAS. The hardware and software configurations detailed below provide the foundation for this integration.
The KLSAS system used for the experiments was configured with high-performance hardware to ensure efficient processing. It operated on Windows 10 Professional and was powered by an AMD Ryzen 7 6800H processor with Radeon Graphics, running at 3.20 GHz. The system was equipped with an NVIDIA GeForce RTX 3070 Ti GPU with 8 GB of VRAM, along with 32 GB of DDR4 RAM, providing ample computational resources for the experiments.
The software environment was set up using the KNIME analytics platform (Version > 5.3.0), which facilitated data analysis and model execution. Additionally, the system incorporated the KNIME AI assistant (Labs) and the KNIME AI extension, enhancing its capabilities for AI-driven tasks and workflow automation.
The KNIME analytics platform serves as our deployment environment, offering an intuitive visual interface for implementing the KLSAS workflow.
Our implementation consists of four primary stages, as illustrated in Figure 6:
  • Prompt preparation: Data preprocessing and formatting;
  • Model selection: Selection of the base model or ORPO-tuned model;
  • Model inference: Sentiment generation for input text;
  • Performance evaluation: Comprehensive model assessment.
The node configurations for each step of the KLSAS workflow, encompassing prompt preparation, model selection, inference, and performance evaluation, are detailed in Table 4.
The implementation culminates in the performance evaluation, comparing the base model (Mistral-Nemo-instruct-2407) and the ORPO-tuned model’s performance using identical evaluation datasets. The performance matrices of these two models are shown in Table 5 and Table 6, respectively.

5. Discussion of Experimental Results

Our experimental results demonstrate the effectiveness of ORPO-tuned LLMs in addressing multi-class sentiment analysis challenges while highlighting areas for potential improvement. This section presents a detailed analysis of the model’s performance, examining specific classification patterns and identifying key areas for optimization.

5.1. Sentiment Classification Performance Analysis

Our analysis reveals distinct performance patterns with the ORPO fine-tuning model across different sentiment categories:
  • Negative sentiment classification (90.00% recall)
    • Successfully identified 45 out of 50 negative samples;
    • Demonstrated robust performance in detecting negative expressions;
    • Minor confusion with neutral sentiment (five misclassifications).
  • Neutral sentiment classification (88.00% recall)
    • Correctly classified 44 neutral samples with high accuracy;
    • Exhibited occasional confusion with negative sentiment, resulting in seven misclassifications;
    • Demonstrated consistent performance across diverse expression patterns, showcasing robust contextual understanding;
    • Achieved a significant performance improvement compared to the baseline model (60% recall).
  • Positive sentiment classification (82.00% recall)
    • Successfully identified 41 positive samples with reasonable accuracy;
    • Experienced a higher misclassification rate compared to other sentiment categories;
    • Displayed notable confusion with neutral sentiment, resulting in seven misclassifications, indicating room for refinement in distinguishing positive and neutral expressions.
The performance gradient across the sentiment categories highlights a hierarchical proficiency in sentiment detection, with negative sentiment being the most accurately classified, followed by neutral and positive sentiments. To further enhance model performance, future optimizations should prioritize increasing precision for positive sentiment and refining the model’s ability to distinguish between neutral and negative sentiments more effectively.
To provide a rigorous statistical foundation for our evaluation, we draw on three key treatments of confusion-matrix metrics. Zeng offers a thorough analytical characterization of confusion-matrix variants in credit-scoring applications, deriving their formal relationships to ROC and KS curves and thus anchoring our interpretation of classifier outputs in a well-defined statistical framework [44]. Li and Guo extend this framework by embedding classical logistic-regression models within an ensemble-learning paradigm, illuminating how confusion-matrix measures behave under hybrid statistical–machine-learning workflows [45]. Powers complements this by systematically surveying performance measures—from precision and recall through F-measure to ROC-derived informedness, markedness, and correlation—highlighting how each metric arises from the underlying confusion-matrix entries and how they interrelate in practice [46].

5.2. Statistical Significance Analysis

To assess whether ORPO’s improvements over strong baselines are statistically robust, we conducted paired t-tests across the five cross-validation folds. Table 1 includes mean ± std values, with ** annotations indicating p < 0.01 versus the strongest non-ORPO baseline. All ORPO comparisons—against SFT-only, DPO, and KTO—achieve p < 0.01, confirming that our unified odds-ratio optimization yields genuine performance benefits.

5.3. Analysis of Neutral Sentiment Misclassification

A neutral sentiment analysis is inherently challenging due to the absence of clear emotional markers, reliance on contextual interpretation, and annotation variability. Neutral expressions can often convey positive or negative meanings depending on context, leading to classification ambiguity. Additionally, inconsistent human annotations further impact model accuracy. Table 7 highlights six examples of misclassified neutral sentiments, illustrating these challenges.

5.4. Key Factors in Neutral Sentiment Classification Challenges

Through our analysis, we have identified three primary factors that influence neutral sentiment classification accuracy:
  • Absence of explicit emotional indicators
    • Pure factual statements often lack clear sentiment markers;
    • The system tends to default to negative classification in ambiguous cases;
    • Examples from samples 2 and 5 demonstrate this challenge.
  • Contextual ambiguity
    • Expressions can carry multiple interpretations depending on context;
    • The model struggles with contextual nuances in samples 1, 4, and 6.
  • Annotation consistency issues
    • The subjective nature of neutral sentiment leads to inconsistent training data;
    • Variation in human interpretation affects model learning;
    • Sample 3 illustrates the challenge of consistent annotation.
These findings highlight the complex nature of neutral sentiment classification and suggest specific areas for future model enhancement. The results indicate that, while ORPO fine-tuning significantly improves overall performance, particular attention should be paid to strengthening the model’s ability to handle contextual ambiguity and implicit sentiment expressions.

5.5. Ablation Study

To isolate the effect of preference alignment in our unified loss, we compare three variants on the 150-sample validation set, namely supervised fine-tuning only (SFT), which is a process of fine-tuning with NLL loss alone without preference alignment; direct preference optimization (DPO); and full ORPO (as defined in Section Head-to-Head Technical Comparison of DPO, KTO, and ORPO).
Table 8 summarizes the overall accuracy, Cohen’s κ, and per-class recall for each approach:
  • SFT (NLL only) achieves moderate overall accuracy (78.67%) with high neutral recall (94%) but suffers in the negative (72%) and positive (70%) classes;
  • DPO improves negative recall to 98%, yet degrades neutral (68%) and positive (66%) performance, yielding a slight drop in accuracy (77.33%);
  • ORPO (full) attains the best balance, with an 86.00% overall accuracy (κ = 0.79) and uniformly strong recall across the negative (90%), neutral (88%), and positive (80%) classes.
These results demonstrate that (a) preference alignment is essential to maintain performance in all sentiment categories, and (b) ORPO’s single-stage odds-ratio loss substantially outperforms both SFT-only and DPO approaches in overall and per-class metrics.

6. Conclusions

Multi-class sentiment analysis in industrial AI faces three key hurdles:
  • Integration complexity: ORPO’s unified loss eliminates separate SFT + RLHF stages, reducing development overhead significantly and simplifying integration into existing fine-tuning pipelines. Furthermore, conventional NLP methods often involve labor-intensive text pre-processing steps, such as tokenization, removing punctuation and stopwords, and part-of-speech tagging, before training a model for inferencing. In contrast, the ORPO-tuned model eliminates the need for pre-processing during inference while delivering superior performance;
  • Scalable infrastructure: Leveraging Unsloth’s memory-efficient kernels and 4-bit quantization, a 74% VRAM reduction can be achieved while maintaining >86% accuracy, enabling deployment on modest GPUs [42,43];
  • Accessible interfaces: Our KNIME-based KLSAS workflow transforms the complex process of sentiment analysis into drag-and-drop nodes, lowering technical barriers for non-programmers.
Together, these advances demonstrate that ORPO+KNIME not only improves multi-class sentiment accuracy (neutral Recall +28% vs. baseline) but also makes deployment more resource- and user-friendly.
The integration of ORPO with Unsloth represents a significant advancement in fine-tuning methodology, being specifically optimized for resource-constrained environments [13]. This novel approach eliminates the conventional requirement for extensive training datasets while enabling rapid deployment through an efficient implementation pipeline. Users can leverage the system’s capability for the real-time processing of prompts and responses, significantly reducing the computational overhead typically associated with traditional supervised learning paradigms. While these advancements mark substantial progress, several challenges persist, including the optimization of inference latency and the enhancement of output consistency.
Our implementation leverages KNIME’s visual programming interface, fundamentally simplifying the deployment process for fine-tuned LLMs and facilitating seamless integration across diverse application domains. This KNIME-based workflow establishes a scalable, accessible model that empowers enterprises to adopt LLMs for enhanced competitiveness across various industries.
The contributions of this work can be summarized as follows:
  • Fine-tuning an LLM for neutral sentiment analysis: The paper customizes an existing LLM to effectively handle neutral sentiment analysis, addressing a specific gap in sentiment classification;
  • Development of a KLSAS system: The fine-tuned LLM is integrated into a KLSAS system, enhancing its functionality and real-world applicability;
  • Providing open-source resources: The paper offers open-source code, the fine-tuned LLM, and clear deployment steps for the KLSAS system, facilitating easy implementation for enterprises.
The significance of our contribution extends beyond immediate practical applications. It enriches academic discourse, advances technical methodologies, and expands the potential of AI in industrial applications. Together, these contributions lay the groundwork for more efficient and impactful AI utilization in both research and practical settings.

6.1. Limitations

While ORPO’s unified loss offers computational efficiency and empirically robust gains, it relies heavily on the quality of preference labels. Noisy or biased feedback can propagate through the odds-ratio term, potentially amplifying annotation artefacts. Additionally, our experiments focused on English-language tweets—performance on other languages or less structured text (e.g., forums) remains untested. Finally, ORPO’s requirement to pair each “accepted” response with at least one explicitly labeled “rejected” response increases annotation overhead, which may limit applicability in low-resource settings.
Fine-tuned models often exhibit high performance on their source domain, yet degrade substantially on target domains with different distributions, a phenomenon known as domain shift. Direct domain transfer without new labeled data is generally not feasible, since sentiment expressions vary by context. In practice, to maintain accuracy, models must be re-fine-tuned on a small, labeled subset of the new domain or employ adversarial domain adaptation techniques. When deploying our sentiment model in a new operational setting, practitioners should gather domain-representative labels to avoid mismatches between training and real-world distributions. Failing to do so risks misclassifying nuanced expressions and perpetuating annotation bias. Moreover, because ORPO heavily relies on preference labels, any bias in those labels (e.g., underrepresentation of certain phrases) will propagate. Users must, therefore, audit and, if necessary, rebalance the annotations when moving to a different domain.

6.2. Ethical Considerations

Improved sentiment classifiers can be misused for unsolicited opinion surveillance, targeted manipulation, or amplifying harmful content. To mitigate misuse, we recommend:
  • Human-in-the-loop: Always involve human oversight for critical decisions (e.g., moderating flagged content);
  • Transparent data handling: Clearly communicate how the preference data was collected, ensuring it is representative and free from demographic bias;
  • Privacy preservation: Anonymize the user text and avoid linking predictions to personally identifiable information.
Moreover, we recognize that automatically optimized preference signals could inadvertently reinforce the stereotypes present in the training data. Future work should explore adversarial de-biasing and active feedback loops to detect and correct such biases.

Author Contributions

Conceptualization, N.-J.S. and J.-C.S.; Methodology, N.-J.S. and J.-C.S.; Resources, N.-J.S.; Writing – original draft, J.-C.S.; Writing—review & editing, Y.-B.L.; Visualization, N.-J.S.; Supervision, Y.-B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data are available in Github and upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Availability of Code and Resources

All code, fine-tuned models, and datasets used in this study are publicly available at our GitHub repository:
https://github.com/alansu7/KLSAS (accessed on 9 March 2025)”
This repository contains:
  • Source code for ORPO fine-tuning using the Unsloth framework;
  • The KLSAS workflow implemented in KNIME;
  • Processed training and evaluation datasets in JSONL format;
  • Model files for inference, including GGUF format for local LLM deployment;
  • Documentation and instructions for reproduction and deployment.
This open-source release aims to promote transparency, reproducibility, and future extensions of our work.

Notes

1
The Complete code can be acquired at Google Colab. Available online: https://drive.google.com/file/d/1_W-koRTRdwlDyEqEvGcgVH3h7sO4EpsK/view?usp=drive_link (accessed on 9 March 2025).
2
The Iecjsu/Airline-Sentiment-ORPO-Train Dataset. https://huggingface.co/datasets/iecjsu/airline-sentiment-ORPO-train (accessed on 10 March 2025).
3
The Iecjsu/Airline-Sentiment-Eval.Jsonl Dataset. Available online: https://huggingface.co/datasets/iecjsu/airline-sentiment-eval.jsonl (accessed on 12 March 2025).
4
The Iecjsu/Mistral-Nemo-IT-2407-ORPOall-f16 Model. Available online: https://huggingface.co/iecjsu/Mistral-Nemo-IT-2407-ORPOall-f16 (accessed on 15 March 2025).

References

  1. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020. [Google Scholar] [CrossRef]
  2. Kaur, P.; Kashyap, G.S.; Kumar, A.; Nafis, M.T.; Kumar, S.; Shokeen, V. From Text to Transformation: A Comprehensive Review of Large Language Models’ Versatility. arXiv 2024. [Google Scholar] [CrossRef]
  3. Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
  4. Ling, C.; Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.; Chowdhury, T.; Li, Y.; Cui, H.; Zhao, T.; et al. Beyond One-Model-Fits-All: A Survey of Domain Specialization for Large Language Models. arXiv 2023. [Google Scholar] [CrossRef]
  5. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
  6. Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar] [CrossRef]
  7. Mohammad, S.M.; Kiritchenko, S.; Sobhani, P.; Zhu, X.; Cherry, C. Practical and ethical considerations in the effective use of emotion and sentiment lexicons. Nat. Lang. Eng. 2022, 28, 121–138. [Google Scholar] [CrossRef]
  8. Liu, B. Sentiment Analysis: Mining Opinions, Sentiments, and Emotions; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar] [CrossRef]
  9. Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Singapore, 6–10 December 2023; pp. 38–45. [Google Scholar] [CrossRef]
  10. Hou, X.; Zhao, Y.; Liu, Y.; Yang, Z.; Wang, K.; Li, L.; Luo, X.; Lo, D.; Grundy, J.C.; Wang, H. Large Language Models for Software Engineering: A Systematic Literature Review. ACM Trans. Softw. Eng. Methodol. 2023, 33, 1–79. [Google Scholar] [CrossRef]
  11. Duan, J.; Zhang, S.; Wang, Z.; Jiang, L.; Qu, W.; Hu, Q.; Wang, G.; Weng, Q.; Yan, H.; Zhang, X.; et al. Efficient Training of Large Language Models on Distributed Infrastructures: A Survey. arXiv 2024. [Google Scholar] [CrossRef]
  12. Lee, C.P. Design, Development, and Deployment of Context-Adaptive AI Systems for Enhanced User Adoption. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, Proceedings of the CHI’24: CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; Association for Computing Machinery: New York, NY, USA, 2024. [Google Scholar] [CrossRef]
  13. Hong, J.; Lee, N.; Thorne, J. ORPO: Monolithic preference optimization without reference model. arXiv 2024. [Google Scholar] [CrossRef]
  14. Berthold, M.R.; Cebron, N.; Dill, F.; Gabriel, T.R.; Kötter, T.; Meinl, T.; Ohl, P.; Thiel, K.; Wiswedel, B. KNIME: The Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization; Springer Nature: Berlin/Heidelberg, Germany, 2022; pp. 319–326. [Google Scholar] [CrossRef]
  15. Ordenes, F.V.; Silipo, R. Machine learning for marketing on the KNIME Hub: The development of a live repository for marketing applications. J. Bus. Res. 2021, 137, 393–410. [Google Scholar] [CrossRef]
  16. Hasib, K.M.; Habib, M.A.; Towhid, N.A.; Showrov, M.I.H. A novel deep learning based sentiment analysis of twitter data for us airline service. In Proceedings of the 2021 International Conference on Information and Communication Technology for Sustainable Development (ICICT4SD), Dhaka, Bangladesh, 27–28 February 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 450–455. [Google Scholar] [CrossRef]
  17. Patel, A.; Oza, P.; Agrawal, S. Sentiment analysis of customer feedback and reviews for airline services using language representation model. Procedia Comput. Sci. 2023, 218, 2459–2467. [Google Scholar] [CrossRef]
  18. Rustam, F.; Ashraf, I.; Mehmood, A.; Ullah, S.; Choi, G.S. Tweets classification on the base of sentiments for US airline companies. Entropy 2019, 21, 1078. [Google Scholar] [CrossRef]
  19. Umer, M.; Ashraf, I.; Mehmood, A.; Kumari, S.; Ullah, S.; Sang Choi, G. Sentiment analysis of tweets using a unified convolutional neural network-long short-term memory network model. Comput. Intell. 2021, 37, 409–434. [Google Scholar] [CrossRef]
  20. Fan, L.; Li, L.; Ma, Z.; Lee, S.; Yu, H.; Hemphill, L. A Bibliometric Review of Large Language Models Research from 2017 to 2023. ACM Trans. Intell. Syst. Technol. 2023, 15, 1–25. [Google Scholar] [CrossRef]
  21. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar] [CrossRef]
  22. Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023. [Google Scholar] [CrossRef]
  23. Golovanov, S.; Kurbanov, R.; Nikolenko, S.I.; Truskovskyi, K.; Tselousov, A.; Wolf, T. Large-Scale Transfer Learning for Natural Language Generation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar] [CrossRef]
  24. Chang, H.; Park, J.; Ye, S.; Yang, S.; Seo, Y.; Chang, D.; Seo, M. How Do Large Language Models Acquire Factual Knowledge During Pretraining? arXiv 2024. [Google Scholar] [CrossRef]
  25. Liu, P.J.; Lin, K.W.; Fullmer, D.; Means, B.; Thickstun, M.; Wang, M.; Kosaraju, V.; Chen, M.; Sohl-Dickstein, J.; Liang, P. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Adv. Neural Inf. Process. Syst. 2022, 35, 1950–1965. [Google Scholar] [CrossRef]
  26. Xu, L.; Xie, H.; Qin, S.J.; Tao, X.; Wang, F.L. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. arXiv 2023. [Google Scholar] [CrossRef]
  27. Casper, S.; Davies, X.; Shi, C.; Gilbert, T.K.; Scheurer, J.; Rando, J.; Freedman, R.; Korbak, T.; Lindner, D.; Freire, P.J.; et al. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback. arXiv 2023. [Google Scholar] [CrossRef]
  28. Rafailov, R.; Sharma, A.; Mitchell, E.; Ermon, S.; Manning, C.D.; Finn, C. Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inf. Process. Syst. 2023, 36, 53728–53741. [Google Scholar] [CrossRef]
  29. Unsloth. 70% + 20% VRAM Reduction. 2024. Available online: https://unsloth.ai/blog/unsloth-checkpointing (accessed on 12 December 2024).
  30. Han, D. Introducing Unsloth: 30x Faster LLM Training. 2023. Available online: https://unsloth.ai/introducing (accessed on 10 December 2024).
  31. Nvidia. NeMo Framework: Advancing AI Research and Applications. 2023. Available online: https://developer.nvidia.com/nemo (accessed on 10 December 2024).
  32. Mistral, A.I. Mistral NeMo: Extending the Boundaries of Contextual Processing. 2023. Available online: https://mistral.ai/news/mistral-nemo/ (accessed on 15 December 2024).
  33. Mohammad, S.; Kiritchenko, S.; Zhu, X. NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. In Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Proceedings of the Second Joint Conference on Lexical and Computational Semantics (SEM), Atlanta, Georgia, 13–14 June 2013; Manandhar, S., Yuret, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 321–327. Available online: https://aclanthology.org/S13-2053/ (accessed on 12 May 2025).
  34. Zhang, Y.; Wallace, B. A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan, 1 December 2017; pp. 253–263. [Google Scholar] [CrossRef]
  35. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  36. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022. [Google Scholar] [CrossRef]
  37. Liu, L.; Fu, X.; Kötter, T.; Sturm, K.; Haubold, C.; Guan, W.; Bao, S.; Wang, F. Geospatial Analytics Extension for KNIME. SoftwareX 2024, 25, 101627. [Google Scholar] [CrossRef]
  38. Sundberg, L.; Holmström, J. Democratizing artificial intelligence: How no-code AI can leverage machine learning operations. Bus. Horiz. 2023, 66, 777–788. [Google Scholar] [CrossRef]
  39. Kaggle. Twitter US Airline Sentiment. 2024. Available online: https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment (accessed on 12 November 2024).
  40. Zhou, Z.; Zhang, Q.; Kumbong, H.; Olukotun, K. LowRA: Accurate and Efficient LoRA Fine-Tuning of LLMs under 2 Bits. arXiv 2025, arXiv:2502.08141. [Google Scholar]
  41. Chen, J.-M.; Chao, Y.-H.; Wang, Y.-J.; Shieh, M.-D.; Hsu, C.-C.; Lin, W.-F. QuantTune: Optimizing Model Quantization with Adaptive Outlier-Driven Fine Tuning. arXiv 2024, arXiv:2403.06497. [Google Scholar]
  42. Daniel; Michael. Finetune Phi-3 with Unsloth. Unsloth AI Blog 2024. 23 May 2024. Available online: https://unsloth.ai/blog/phi3 (accessed on 20 March 2025).
  43. Han-Chen, D. Make LLM Fine-Tuning 2× Faster with Unsloth and TRL. Hugging Face Engineering Blog 2024. 10 January 2024. Available online: https://huggingface.co/blog/unsloth-trl (accessed on 1 March 2025).
  44. Zeng, G. On the confusion matrix in credit scoring and its analytical properties. Commun. Stat.-Theory Methods 2019, 49, 2080–2093. [Google Scholar] [CrossRef]
  45. Li, Y.-S.; Guo, C.-Y. Random logistic machine (RLM): Transforming statistical models into machine learning approach. Commun. Stat.-Theory Methods 2023, 53, 7517–7525. [Google Scholar] [CrossRef]
  46. Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2011, arXiv:2010.16061. [Google Scholar]
Figure 1. Concurrent increase in probabilities of preferred and rejected responses during SFT [13].
Figure 1. Concurrent increase in probabilities of preferred and rejected responses during SFT [13].
Systems 13 00523 g001
Figure 2. KLSAS: a sentiment analysis system deploying fine-tuned LLM with real-world data on the KNIME platform.
Figure 2. KLSAS: a sentiment analysis system deploying fine-tuned LLM with real-world data on the KNIME platform.
Systems 13 00523 g002
Figure 3. The distribution of sentiment class in Tweets.csv.
Figure 3. The distribution of sentiment class in Tweets.csv.
Systems 13 00523 g003
Figure 4. Workflow for prompt-based sentiment classification with LLMs.
Figure 4. Workflow for prompt-based sentiment classification with LLMs.
Systems 13 00523 g004
Figure 5. Preprocessing of the raw data (Tweets.csv) aims to enhance discriminative ability across different sentiment categories.
Figure 5. Preprocessing of the raw data (Tweets.csv) aims to enhance discriminative ability across different sentiment categories.
Systems 13 00523 g005
Figure 6. KNIME workflow for the development of KLSAS.
Figure 6. KNIME workflow for the development of KLSAS.
Systems 13 00523 g006
Table 1. Comparative performance of state-of-the-art sentiment-analysis models (neutral-recall focus).
Table 1. Comparative performance of state-of-the-art sentiment-analysis models (neutral-recall focus).
RankStudyModelAccuracyNeutral RecallFeature Extract
1This StudyKLSAS86% *88% *Transformer
2Rustam et al. [18]VC (LR+SGD)79.20%77%TF-IDF
3Rustam et al. [18]LR78.70%75%TF-IDF
4Rustam et al. [18]ETC77.20%75%TF
5Rustam et al. [18]SGD79.20%74%TF-IDF
6Rustam et al. [18]CC79.10%74%TF-IDF
7Rustam et al. [18]SVM78.50%74%TF-IDF
8Rustam et al. [18]ETC76.10%74%TF-IDF
9Rustam et al. [18]RF75.80%74%TF-IDF
10Rustam et al. [18]CC78.90%73%TF
11Rustam et al. [18]LR78%73%TF
12Rustam et al. [18]RF76.30%73%TF
13Rustam et al. [18]VC (LR+SGD)79.10%72%TF
14Rustam et al. [18]SGD79.20%71%TF
15Rustam et al. [18]SVM77.30%70%TF
16Rustam et al. [18]GBM74%70%TF
17Rustam et al. [18]ADB74.50%69%TF
18Rustam et al. [18]ADB74.60%68%TF-IDF
19Rustam et al. [18]GBM73.40%66%TF-IDF
20Umer et al. [19]SGD79.20%65%TF-IDF
21Umer et al. [19]VC (LR+SGDC)79.20%65%TF-IDF
22Umer et al. [19]SVM78.50%61%TF-IDF
23Umer et al. [19]RF75.80%61%TF-IDF
24Rustam et al. [18]DT67.20%59%TF
25Rustam et al. [18]DT68.60%55%TF-IDF
26Umer et al. [19]VC (LR+SGDC)78.20%49%word2vec
27Umer et al. [19]SGD78.20%48%word2vec
28Umer et al. [19]SVM78.00%48%word2vec
29Patel & Agrawal [17]KNN67%47%
30Patel & Agrawal [17]BERT83%46%Transformer
31Patel & Agrawal [17]DT67%43%
32Patel & Agrawal [17]RF77%37%
33Umer et al. [19]RF73.70%28%word2vec
34Rustam et al. [18]GNB43.80%24%TF-IDF
35Rustam et al. [18]GNB41.80%23%TF
36Patel & Agrawal [17]ADB72%8%
37Patel & Agrawal [17]LR65%0%
38Patel & Agrawal [17]SVM65%0%
Note: KLSAS (Knime LLM-based sentiment analysis system), VC (voting classifier), LR (logistic regression), SGD (stochastic gradient descent), ETC (extra trees classifier), CC (calibrated classifier), SVM (support vector machine, is also known as support vector classifier, SVC), RF (random forest), GBM (gradient-boosting machine), ADB (AdaBoost), DT (decision tree), KNN (K-nearest neighbors), BERT (bidirectional encoder representations from transformers), GNB (Gaussian naïve Bayes). The asterisk indicates the best performance in the model list.
Table 2. Head-to-head comparison of DPO, KTO, and ORPO.
Table 2. Head-to-head comparison of DPO, KTO, and ORPO.
MethodNumber of
Optimization Stages
Reliance on
Reference Policy
Computational OverheadStability
DPOTwo
(SFT → Preference)
Requires a separate reward/
reference model
High
(extra reward-model training)
Moderate
(RL stage can fluctuate)
KTOTwo
(SFT → Utility Alignment)
Requires a reference policy
(binary utility calibration)
Moderate–High
(utility function tuning)
Moderate
(utility nonlinearity adds noise)
ORPOOne
(Unified SFT+Preference)
Reference-free (no external
policy needed)
Low
(single stage, no extra model)
High
(faster, more stable convergence)
Table 3. ORPOTrainer configuration parameters.
Table 3. ORPOTrainer configuration parameters.
ParameterValue
Base ModelMistral-Nemo-Instruct-2407-bnb-4bit
Sequence Length4096 tokens (auto RoPE scaling)
Quantization4-bit
LoRA Rank16
OptimizerAdamW (8-bit precision)
Gradient CheckpointingUnsloth-optimized implementation
Learning Rate ScheduleCosine decay with 5% warmup
ORPO λ Term0.4
Table 4. Configurations of nodes to implement KLSAS.
Table 4. Configurations of nodes to implement KLSAS.
StepKNIME NodeConfiguration
1. Prompt Preparation1.1 CSV ReaderRead from: Mountpoint, Local
Mode: File
File: path of your computer/filename.csv
1.2 Row SamplingSampling Method
Systems 13 00523 i001 Absolute = 150
Systems 13 00523 i001 Stratified sampling = label
☑ Use random seed = 100
1.3 Table CreatorTable Creator Settings
Column name = instruct
Row0 = Classify the following text using one of these labels: neutral, positive, and negative:
1.4 Column AppenderSystems 13 00523 i002 Generate new RowIDs
1.5 Missing ValueColumn Settings
instruct = Previous Value
1.6 String ManipulationExpression = join($instruct$, “\”“,$text$,”\” “)
Systems 13 00523 i001 Append Column = prompt
2. Model Selection2.1 Local GPT4All
LLM Connector
Model path: Your path/Mistral-Nemo-IT-2407-ORPOall-f16.gguf
Model parameters: default
3. Model Inference3.1 LLM PrompterPrompt column = prompt
Response column name = Response
4. Model Performance Evaluation4.1 Column FilterColumn filter = Manual
Includes = text, label, prompt
4.2 JoinerMatch=Any of the following
Top input (‘left’ table) = RowID
Bottom input (’right’ table) = RowID
Compare values in join columns by = Value and type
☑ Matching rows
4.3 Column ExpressionsExpressions:
removeChars(lowerCase(column(“Response”)))
4.4 Rule EngineExpressions:
$response$ LIKE “*neutral*”=>“neutral”
$response$ LIKE “*positive*”=>“positive”
$response$ LIKE “*negative*”=>“negative”
TRUE => “unclassified”
Systems 13 00523 i001 Replace column = response
4.5 Scorer (Jav Script)Title = Score View
Subtitle = Confusion Matrix
Actual column = label
Predicted column = response
4.6 Column ResorterColumn = text, label, instruct, Response, response
4.7 Table ViewerDisplay column = Manual
Includes = label, response
☑ Show RowIDs
Table 5. Base model (Mistral-Nemo-instruct-2407) performance: overall accuracy = 75.33%, Cohen’s kappa = 0.630, errors = 37.
Table 5. Base model (Mistral-Nemo-instruct-2407) performance: overall accuracy = 75.33%, Cohen’s kappa = 0.630, errors = 37.
Negative (Predicted)Neutral (Predicted)Positive (Predicted)
Negative (Actual)482096.00%
Neutral (Actual)1830260.00%
Positive (Actual)3123570.00%
69.57%68.18%94.59%
Table 6. ORPO-tuned model performance: overall accuracy = 86.67%, Cohen’s kappa = 0.800, errors = 20.
Table 6. ORPO-tuned model performance: overall accuracy = 86.67%, Cohen’s kappa = 0.800, errors = 20.
Negative (Predicted)Neutral (Predicted)Positive (Predicted)
Negative (Actual)455090.00%
Neutral (Actual)644088.00%
Positive (Actual)274182.00%
84.91%78.57%100.00%
Table 7. Misclassified neutral as negative samples.
Table 7. Misclassified neutral as negative samples.
TextDiscussions
1@united any plans of restating nonstop service between IAD and South Florida? We miss our flights to FLL.Analysis: The commenter asks about plans to restore nonstop flights between IAD and South Florida and expresses, “We miss our flights to FLL.”
Reason: The model may interpret “miss our flights” as regret or dissatisfaction, classifying it as negative. Additionally, inquiring about flight restoration may be seen as dissatisfaction with current services.
Inference: Ambiguity in sentiment and context dependency. “Miss our flights” could mean nostalgia (neutral) or disappointment (negative), requiring context for proper interpretation.
2@united how come it’s cheaper to fly to BKK than NRT even though to get to BKK you take an extra flight, from NRT!Analysis: The commenter questions the pricing strategy, noting that it is cheaper to fly to BKK than NRT, despite requiring an additional flight from NRT.
Reason: The model may interpret “how come” and the exclamation mark as dissatisfaction or complaint, classifying it as negative.
Inference: Lacks clear emotional markers. It simply states a question without containing overt sentiment words.
3@USAirways also, can you explain why, when I checked in, on the US Airways site, & picked “Standby for 1st” I was not put on the list?Analysis: The commenter asks why they were not placed on the standby list for first class despite selecting the option during check-in.
Reason: The model may interpret “can you explain why” as dissatisfaction or complaint, classifying it as negative.
Inference: Annotation inconsistency and human judgment differences. Customer inquiries can be interpreted as either “neutral” or “dissatisfaction,” leading to confusion during model training.
4Thanks, @chasefoster. Was just about to book a flight to UK using @AmericanAir, but after reading this exchange, there’s no way. I’ve sentAnalysis: The commenter thanks someone and states they were planning to book a flight with @AmericanAir but decided against it after reading an exchange.
Reason: The phrase “there’s no way” expresses strong rejection, leading the model to classify it as dissatisfaction with @AmericanAir.
Inference: Ambiguity in sentiment and context dependency. “There’s no way” requires contextual judgment.
5@AmericanAir this receipt doesn’t show the evoucher value nor does it mention having used an evoucher.Analysis: The commenter points out that the receipt does not show the e-voucher value or mention its use.
Reason: The model may interpret phrases like “doesn’t show” and “nor does it mention” as dissatisfaction or complaint, classifying it as negative.
Inference: Lacks clear emotional markers. It simply “states a problem” without containing overt sentiment words.
6@AmericanAir I am trying to switch my flight to AA 1359 I am currently on AA 2401 at 6:50 am MONDAY morn then AA2586! Help Me!!Analysis: The commenter is trying to change their flight and is requesting assistance.
Reason: The use of “Help Me!!” and multiple exclamation marks may lead the model to interpret the tone as anxiety or dissatisfaction, classifying it as negative.
Inference: Ambiguity in sentiment and context dependency. The exclamation marks require context for accurate interpretation.
Table 8. Ablation study results on airline-tweet sentiment validation set.
Table 8. Ablation study results on airline-tweet sentiment validation set.
ModelAccuracyCohen’s kNegative RecallNeutral RecallPositive Recall
SFT (NLL only)78.67%0.68072.00%94.00%70.00%
DPO77.33%0.66098.00%68.00%66.00%
ORPO (full)86.00%0.79090.00%88.00%80.00%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Shen, J.-C.; Su, N.-J.; Lin, Y.-B. Effective Multi-Class Sentiment Analysis Using Fine-Tuned Large Language Model with KNIME Analytics Platform. Systems 2025, 13, 523. https://doi.org/10.3390/systems13070523

AMA Style

Shen J-C, Su N-J, Lin Y-B. Effective Multi-Class Sentiment Analysis Using Fine-Tuned Large Language Model with KNIME Analytics Platform. Systems. 2025; 13(7):523. https://doi.org/10.3390/systems13070523

Chicago/Turabian Style

Shen, Jin-Ching, Nai-Jing Su, and Yi-Bing Lin. 2025. "Effective Multi-Class Sentiment Analysis Using Fine-Tuned Large Language Model with KNIME Analytics Platform" Systems 13, no. 7: 523. https://doi.org/10.3390/systems13070523

APA Style

Shen, J.-C., Su, N.-J., & Lin, Y.-B. (2025). Effective Multi-Class Sentiment Analysis Using Fine-Tuned Large Language Model with KNIME Analytics Platform. Systems, 13(7), 523. https://doi.org/10.3390/systems13070523

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop