Next Article in Journal
A Dependency-Aware Task Stealing Framework for Mobile Crowd Computing
Previous Article in Journal
Intellectual Property Protection Through Blockchain: Introducing the Novel SmartRegistry-IP for Secure Digital Ownership
Previous Article in Special Issue
A High-Capacity Reversible Data Hiding Scheme for Encrypted Hyperspectral Images Using Multi-Layer MSB Block Labeling and ERLE Compression
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sentiment Analysis in Mexican Spanish: A Comparison Between Fine-Tuning and In-Context Learning with Large Language Models

by
Tomás Bernal-Beltrán
1,
Mario Andrés Paredes-Valverde
2,
María del Pilar Salas-Zárate
2,
José Antonio García-Díaz
1 and
Rafael Valencia-García
1,*
1
Departamento de Informática y Sistemas, Universidad de Murcia, Campus de Espinardo, 30100 Murcia, Spain
2
Tecnológico Nacional de México, I.T.S. Teziutlán, Fracción I y II, Teziutlán 73960, Puebla, Mexico
*
Author to whom correspondence should be addressed.
Future Internet 2025, 17(10), 445; https://doi.org/10.3390/fi17100445
Submission received: 29 July 2025 / Revised: 2 September 2025 / Accepted: 25 September 2025 / Published: 29 September 2025

Abstract

The proliferation of social media has made Sentiment Analysis an essential tool for understanding user opinions, particularly in underrepresented language variants such as Mexican Spanish. Recent advances in Large Language Models have made effective sentiment analysis through in-context learning techniques, reducing the need for supervised training. This study compares the performance of zero and few-shot with traditional fine-tuning approaches of tourism-related texts in Mexican Spanish. Two annotated datasets from the REST-MEX 2022 and 2023 shared tasks were used for this purpose. Results show that fine-tuning, particularly with the MarIA model, achieves the best overall performance. However, modern LLMs that use in-context learning strategies, such as Mixtral 8x7B for zero-shot and Mistral 7B for few-shot, demonstrate strong potential in low-resource settings by closely approximating the accuracy of fine-tuned models, suggesting that in-context learning is a viable alternative to fine-tuning for sentiment analysis in Mexican Spanish when labeled data is limited. These approaches can enable intelligent, data-driven digital services with applications in tourism platforms and urban information systems that enhance user experience and trust in large-scale socio-technical ecosystems.

1. Introduction

Social media has become the main source of information for much of the global population, particularly younger users. In fact, it has surpassed traditional media sources to become the main source of information for many people [1]. This shift towards digital consumption has democratized access to information, enabling users to instantly share their opinions and take part in social interactions on a global scale. However, this immediacy and broad reach has also facilitated the spread of hate speech, misinformation and social polarization. Previous studies have shown that negative and polarizing content spreads more quickly than truthful information on social media platforms [2]. Furthermore, the lack of effective moderation, coupled with the amplification of extremist and populist content driven by social media algorithms, contributes to the normalization of intolerant and extremist attitudes [3].
In this context, sentiment analysis (SA) of social media posts has become an essential tool for understanding and mitigating the negative consequences of online interactions. SA techniques can identify patterns of polarization, detect hate speech or misinformation and assess the emotional tone of shared content. In particular, SA using Natural Language Processing (NLP) techniques has become increasingly relevant in recent decades, in line with the growing use of social networks within the population. Since the early 2000s, SA has been one of the most active research areas in NLP. While the majority of significant scientific efforts have focused on the English language, research on SA in Spanish remains limited, particularly with regard to varieties of Spanish spoken outside of Spain.
Recent advances in NLP, particularly the introduction of the Transformer architecture [4], have revolutionized the field by enabling the development of Large Language Models (LLMs). These powerful models have billions of parameters and can learn complex patterns in language and context. They surpass the accuracy and efficiency of traditional NLP techniques, achieving state-of-the-art results in various downstream tasks such as SA, hate speech detection and topic classification. Furthermore, the in-context learning (ICL) capabilities of LLMs have made it possible to perform these tasks without specific training (fine-tuning), simply by providing clear instructions on how to solve the task as an input prompt.
Despite these advances, most existing LLMs are predominantly trained on English data, resulting in poor performance in other languages, including Spanish. This issue is further exacerbated by informal, regional and dialectal variations that are typical of social media. This underscores the importance of conducting targeted research into adapting and evaluating LLMs for SA in Spanish and its dialects. Such research is essential to ensure equitable access to the benefits of these technologies across linguistic communities. However, despite the growing popularity of LLMs, comparative studies between fine-tuning and ICL-based strategies in non-English and dialect-specific settings remain scarce. While multilingual models such as mBERT [5], XLM-RoBERTa [6], and LaBSE [7] have shown promising results across languages, their performance remains inconsistent for regional varieties of Spanish, including Mexican Spanish. Similarly, prior work on sentiment analysis in Latin American contexts [8,9] highlights the challenges posed by informal language and dialectal variation, but does not explicitly compare fine-tuning with ICL-based strategies. Foundational research on zero-shot (ZSL) and few-shot (FSL) learning with LLMs [10] has demonstrated the potential of ICL-based strategies; however, their effectiveness in underrepresented dialects remains underexplored. This justifies the relevance of our study, which systematically compares fine-tuning and ICL approaches for SA in Mexican Spanish.
This work is not only relevant in the field of NLP, but also contributes to a broader vision of next-generation internet ecosystems. Previous studies published in the Future Internet journal have demonstrated the importance of opinion mining for intelligent digital platforms. For instance, in [11], the authors employ character- and word-level features alongside attention mechanisms to enhance the effectiveness of microblog SA, while in [12], the authors propose a novel recommendation system that leverages ensemble learning by integrating SA of textual data with collaborative filtering techniques to offer users more precise and personalized recommendations. In this context, using LLMs for SA in Mexican Spanish tourism-related texts is key to enabling distributed intelligent systems to process massive streams of online opinions in real time. This improves the user experience and enables more natural human–machine interaction in multilingual contexts.
Furthermore, this work’s application in the tourism sector links directly to smart city services and intelligent information and communication systems, areas of strong interest to the Future Internet journal. One relevant example is [13], in which the authors demonstrate how context-aware SA in tourism can provide valuable insights for comprehensive destination monitoring and management by offering a detailed understanding of tourist experiences. Motivated by this, our approach enables the automatic understanding of citizens’ opinions, optimizing digital services, promoting the inclusion of underrepresented linguistic communities and bolstering trust in socio-technical platforms.
In this context, the study evaluates two Mexican Spanish (a dialect of Spanish spoken in Mexico) datasets related to SA and thematic unsupervised classification, including topic and location classification, in tourism-related texts. The study aims to evaluate the effectiveness of ICL-based strategies for LLMs compared to fine-tuning strategies for the SA task. To this end, the following research questions were defined:
  • RQ1. How effective are ICL-based strategies for the SA task in Mexican Spanish tourism texts?
  • RQ2. How does the performance of ICL-based strategies compare with fine-tuning an LLM for this task?
  • RQ3. Which LLMs perform best when using ICL-based techniques for this task?
  • RQ3. Are there notable differences in the performance when applying different ICL-based techniques to this task?
The remainder of this paper is organized as follows: Section 2 reviews related work, focusing on recent developments in SA and the advancements in ICL capabilities. Section 3 details the datasets used and describes the experimental setup for both fine-tuning and ICL-based approaches. Section 4 reports the results of the experiments. Section 5 provides a more in-depth analysis and interpretation of the findings. Section 6 summarizes our findings and outlines possible directions for future research.

2. Background Information

Social media has become the main source of information for much of the global population, surpassing traditional media sources [1]. While this allows people to instantly share their opinions and take part in social interactions on a global scale, it also facilitates the spread of hate speech, misinformation and social polarization. Previous studies have shown that negative and polarizing content spreads more quickly than truthful information on social media platforms [2]. Furthermore, the recommendation algorithms used by social networks tend to amplify extremist and populist content, thereby normalizing harmful or intolerant ideologies among certain user groups as it generates greater interaction [3]. The combination of virality and algorithmic amplification raises concerns about online social polarization. This underlines the importance of developing tools to monitor and mitigate these negative effects.
In response to these challenges, SA (also known as opinion mining), has become essential for understanding the dynamics of online interactions and identifying problematic content. SA involves determining the attitude or emotion expressed in a text (e.g., positive, negative or neutral). It is essential for identifying patterns of polarization, hate speech and misinformation in people’s opinions [14]. SA has been used for years to analyze other types of opinionated content, such as online reviews and news articles. However, social media content poses unique challenges to NLP in general and SA in particular, such as a limited message length, the fact that some posts are based on the media associated with it, and the use of jargon, abbreviations and informal language, as discussed in [15].
Early approaches to SA relied on emotional content lexical resources, such as dictionaries of sentiment words and phrases expressing sentiment, along with their associated orientations and strengths. Examples of such resources can be found in [16,17]. These lexical resources were used to compute a sentiment score for a given text. Machine Learning (ML) algorithms were also employed in SA, where models were trained using text representations. These models were trained using either simple representations, such as unigrams (bags of words) [18] or lexical n-grams [19], or with more complex representations, such as those proposed in [20]. The author of the latter study proposed that using a large initial feature set obtained through abstract linguistic analysis, combined with feature reduction, improved the accuracy of ML techniques for SA in a very noisy domain such as customer feedback. While these approaches enabled significant progress, they had notable limitations. These methods struggled to capture the irony, context and the linguistic variations that are inherent in informal language on social networks.
In recent years, the adoption of deep learning (DL) techniques has enabled remarkable improvements in the performance of SA systems and in NLP in general [21]. Techniques such as Artificial Neural Networks (ANNs), which were used in [22] for document-level SA, where the authors demonstrated that ANNs produced results comparable to those of Support Vector Machines (SVMs) in most cases, have been employed. Convolutional Neural Networks (CNNs) have also been used. For instance, in [23], CCNs were used to learn sentence representations by considering sentence relationships from word embeddings. Lastly, Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and their slight variation, the Gated Recurrent Unit (GRU), were used in [23] to adaptively encode the semantics of sentences and their inherent relations in text for SA, alongside other DL techniques.
However, it was the introduction of the Transformer architecture [4] that really changed everything. Large-scale pre-trained Transformer models, also known as LLMs (Large Language Models), transformed NLP by demonstrating that a large model trained on vast amounts of text can “understand” and represent language extremely effectively, transferring that knowledge to a variety of subsequent tasks. In the field of SA, these pre-trained language models have achieved unprecedented levels of accuracy, outperforming traditional methods by a wide margin. For example, in different corpora of Spanish tweets related to sexist comments from various editions of the EXIST shared task [24,25], BETO [26] and MarIA [27], both of which are BERT-based [5] LLMs, which were trained using Spanish language data, performed well. They produced consistent results in both the 2021 [24] and 2022 [25] editions of the dataset, with MarIA achieving slightly higher accuracy in both cases. These LLMs achieved higher accuracies than those obtained by task participants who used classical and deep learning-based approaches at the time.
These LLMs’ ICL capability is a revolutionary aspect, allowing them to solve tasks for which they have not been specifically trained by providing clear instructions and, optionally, a list of examples of how to solve the task as the input prompt. This enables LLMs to adapt their behavior to specific tasks without the need for further retraining, thereby reducing computational costs and facilitating their use in languages and domains where annotated resources are limited. This ICL capability can be exploited using techniques such as ZSL [28], where the model receives explicit natural language instructions about the task to be solved (for example, the prompt could be “Classify this text as positive, negative, or neutral”) or FSL [29], where the model receives natural language instructions and some annotated examples within the same prompt. Several studies have demonstrated that LLMs can achieve competitive performance compared to models that have been specifically trained for a task when a well-designed prompt is used. This has been shown to work in languages other than English, such as Spanish and its regional variants. In some cases, LLMs can even improve with the inclusion of appropriate examples. For example, in [10], the authors evaluated the performance of different LLMs in the SA task in a cross-lingual scenario using ICL techniques. They demonstrated that an LLM trained in the English language with few input examples could rival classifiers specifically trained for the SA task.
In addition to full fine-tuning and ICL techniques, recent research has introduced a third level of strategies designed to make LLMs more efficient and align them with human preferences.
Methods such as adapters, prefix-tuning and Low-Rank Adaptation (LoRA) are examples of parameter-efficient fine-tuning (PEFT), which allows only a small fraction of the model parameters to be updated, significantly reducing computational costs while preserving performance in downstream tasks. For example, in [30], the authors successfully applied LoRA to update an LLM on an SA task, demonstrating that lightweight adaptation can match or even surpass traditional fine-tuning while enabling deployment with limited resources. In parallel, instruction tuning has emerged as a key paradigm in which models are trained on large collections of instruction–response pairs to improve their ability to generalize across tasks and languages. For instance, in [31], the authors applied instruction tuning to a general-purpose LLM using a small set of financial sentiment examples reformulated as instructions. The resulting model notably outperformed both state-of-the-art sentiment classifiers and widely used LLMs, such as ChatGPT and LLaMA, in scenarios requiring numerical reasoning and contextual understanding.
Reinforcement Learning from Human Feedback (RLHF) is also a powerful alignment technique that combines supervised fine-tuning with reinforcement learning guided by human evaluative signals. While its most notable applications are in conversational agents such as ChatGPT rather than in SA directly, recent efforts have begun to explore related concepts. In [32], the authors propose inferring reward functions from natural language feedback by decomposing human comments into aspects and sentiment polarities through aspect-based SA. These are then used as signals within an inverse reinforcement learning framework. More recently, Constitutional AI (CAI) has emerged as an alternative alignment paradigm, reducing reliance on costly human feedback by guiding model behavior through a set of predefined rules or principles. Its most visible applications have been in dialogue safety and ethical alignment, as demonstrated in [33], where the authors showed that explicit constitutions enable models to self-critique and revise outputs without the need for direct human reward signals. Although CAI has not yet been applied directly to SA, the paradigm suggests promising ways to integrate domain-specific rules, such as those relating to fairness in polarity detection or cultural sensitivity, into SA tasks.
Together, these approaches demonstrate the variety of adaptation strategies available to complement traditional fine-tuning and ICL methods. They also highlight potential avenues for advancing SA tasks in underrepresented languages, such as Mexican Spanish.

3. Materials and Methods

This section presents the experimental setup designed to address the research questions concerning the performance of various LLMs on the SA task. Two main strategies are considered in the evaluation: fine-tuning and ICL. The latter includes both ZSL and FSL approaches.
To facilitate the comparison between these paradigms, the section is structured as follows. Section 3.1 describes the datasets used in our experiments. Then, Section 3.2 details the experimental setup, which includes three evaluation strategies: fine-tuning of encoder-based LLMs (used as baselines) and ICL-based approaches (ZSL and FSL).
Although PEFT methods, instruction tuning, RLHF, and CAI represent significant progress in adapting and aligning LLMs, these approaches were not included in the present study. The primary objective of this study was to conduct a systematic comparison of traditional fine-tuning and ICL techniques for SA in Mexican Spanish tourism-related texts under controlled experimental conditions. Including additional strategies such as PEFT or instruction tuning would have required extensive hyperparameter exploration, diverse training regimes, and access to specialized instruction–response datasets, all of which were beyond the scope of this work. Similarly, while RLHF and Constitutional AI are promising for improving alignment, they demand substantial human resources, domain-specific constitutions, or reward modeling infrastructures that are not yet available for Mexican Spanish. For these reasons, we focused on the two most widely adopted paradigms in the current literature: full fine-tuning and ICL strategies. This ensures a clear and reproducible experimental framework.

3.1. Datasets

To evaluate the performance of the LLMs under study, we selected datasets specifically designed for SA in Mexican Spanish, in alignment with the research questions posed in this work. Specifically, we used datasets from the 2022 [8] and 2023 [9] editions of the REST-MEX shared task, which focus on analyzing user opinions in the tourism domain. These datasets are particularly valuable for Mexican Spanish research, given the limited availability of high-quality, annotated corpora for SA in this language variant. Furthermore, the selected datasets have been used in international workshops such as IberLEF, providing a shared evaluation framework that facilitates comparison between different approaches and promotes the development of language technologies for underrepresented varieties of Spanish.
We focused on a single task, n-ary emotion detection, since this is the specific task addressed in both REST-MEX datasets. This involves classifying the sentiment expressed in a given text using a five-point ordinal scale, where 1 represents the most negative sentiment and 5 the most positive. This approach allows us to assess the models’ ability to capture varying degrees of polarity in fine detail. It is important to note that the official gold labels for the test splits of both REST-MEX editions are not publicly available. Consequently, we reorganized the available data by partitioning the original training sets to create new test sets. Specifically, we randomly selected 20% of the original training data to use as a test set, and used the remaining 80% to train the models in the fine-tuning strategies. Consequently, the results reported in this study are not directly comparable to those published on the official leaderboards of the shared tasks. However, this adjustment ensures a fair and consistent evaluation framework across all models and experimental settings considered in our work.
The selected datasets are described below, and Table 1 provides a summary of their size and the data partitions used in our experiments.
  • REST-MEX 2022: This dataset was designed for the SA task on Mexican Spanish tourism-related texts. User reviews were extracted from TripAdvisor [8]. The training set comprises 30,212 reviews, each of which is annotated with two labels: the sentiment polarity (on an ordinal scale from 1 to 5) and the type of attraction being reviewed (hotel, restaurant or tourist site). As previously mentioned, since the official test splits are not publicly available, we partitioned the original training set, holding out 20% to serve as a test set. This resulted in 24,169 instances for training (used only for fine-tuning) and 6043 instances for testing.
  • REST-MEX 2023: This dataset builds on the previous edition by increasing the number of instances and including reviews from additional countries (Cuba and Colombia) [9]. The training set contains 251,702 reviews, each of which is annotated with three labels: sentiment polarity (1–5), the type of attraction (hotel, restaurant or tourist site) and the country of origin of the review (Mexico, Cuba or Colombia). Using the above partitioning, the dataset is divided into 201,361 instances for training (used solely for fine-tuning) and 50,341 instances for testing.
Table 1. Size and data partitions of the datasets used.
Table 1. Size and data partitions of the datasets used.
DatasetTrainTestTotal
REST-MEX 202224,169604330,212
REST-MEX 2023201,36150,341251,702

3.2. Experimental Setup

This subsection describes the experimental setup designed to evaluate and compare the performance of fine-tuning and ICL-based strategies for SA in Mexican Spanish. This includes selecting appropriate models, defining evaluation strategies and designing prompts for the ICL-based approaches. To ensure a fair comparison, all experiments were conducted using the same dataset partitions and evaluation metrics.
The setup is structured into three parts. First, we present the encoder-based models used in the fine-tuning approach, which serve as baselines (see Section 3.2.1). Second, we describe the LLMs employed in the ICL-based experiments, including models from the Gemma, LLaMA, Mistral and Qwen families (see Section 3.2.2). Finally, we detail the prompts designed for both ZSL and FSL settings, which are key to leveraging the ICL capabilities of the selected LLMs (see Section 3.2.3).

3.2.1. Models Used in the Fine-Tuning Approach

To fairly compare the ZS and FS capabilities of generative LLMs with supervised fine-tuning approaches, we established a strong baseline by fine-tuning a diverse set of popular encoder-based models pre-trained specifically for the Spanish language. Due to their linguistic alignment and relatively low computational cost compared to large decoder-based LLMs, these models serve as strong baselines for the SA task in Mexican Spanish. They vary in architecture (BERT, RoBERTa) and pre-training objectives, as well as in optimisation strategies such as distillation and parameter sharing. This allows us to evaluate how different design choices affect performance on the SA task in Mexican Spanish. The models selected to evaluate the fine-tuning approach are the following:
  • MarIA [27]: This is a RoBERTa-based model that has been pre-trained specifically for the Spanish language. It was trained using the largest known Spanish corpus to date: 570 GB of clean, de-duplicated text, compiled from web crawls conducted by the Biblioteca Nacional de España between 2009 and 2019. MarIA is one of the most robust large-scale monolingual models available for standard Spanish.
  • BETO (cased and uncased) [26]: Based on the original BERT architecture and pre-trained exclusively on Spanish corpora, BETO is a widely used baseline in Spanish NLP. Both cased and uncased variants were evaluated to assess the impact of case sensitivity, particularly in informal and user-generated content.
  • BERTIN [34]: This is a lightweight RoBERTa-base model pre-trained specifically for the Spanish language and designed to strike a balance between performance and computational cost. It was trained on the Spanish portion of the mC4 corpus, a large-scale multilingual dataset derived from Common Crawl.
  • ALBETO [35]: This is a compressed variant of BETO, inspired by the ALBERT architecture and pre-trained specifically for the Spanish language. ALBETO employs parameter sharing and factorized embeddings to reduce model size while maintaining semantic capabilities.
  • DistilBETO [35]: This is a distilled version of BETO based on the DistilBERT architecture, pre-trained specifically for Spanish. It combines the efficiency and reduced size of DistilBERT with the language-specific advantages of BETO.
To adapt each encoder-based model for the SA task, we applied a standard fine-tuning procedure with labeled data from the REST-MEX datasets. To ensure a fair comparison across all models and isolate the impact of differences in model architecture and pre-training, we used the same set of hyperparameters for every fine-tuning run. Additionally, we did not perform a hyperparameter search, as our objective was to evaluate the relative performance of the models under equal training conditions, rather than to find the optimal configuration for each individual model. The hyperparameters used were 15 training epochs, a batch size of 16, a learning rate of 3 × 10 5 , and a weight decay of 0.01.
All fine-tuning experiments were executed with the Hugging Face Transformers library (v. 4.51.3) and PyTorch (v. 2.6.0) using the Trainer API. The AdamW optimiser was employed together with a linear learning rate scheduler, and early stopping, checkpoint-based model selection and gradient accumulation were not used, ensuring that every model was trained for exactly 15 epochs under identical conditions. To ensure determinism, we fixed the random seed for Python (v. 3.10.16), NumPy (v. 2.2.5) and PyTorch (v. 2.6.0) and used an 80/20 training/validation split on the training data. Text inputs were minimally preprocessed by concatenating the title and the opinion/review fields into a single sequence (Text = Title + Opinion/Review), while respecting diacritics and the prescribed casing for each model variant (e.g., cased vs. uncased BETO). Tokenization was handled using each model’s native tokenizer with truncation and padding to a maximum sequence length of 128 tokens to ensure consistency between training and inference.

3.2.2. Models Used in the ICL-Based Approaches

In order to evaluate the performance of ICL strategies for SA in Mexican Spanish, including both ZSL and FSL, we selected a representative set of open-weight, decoder-based LLMs with instruction-following capabilities. The aim was to explore how different design choices influence ICL performance under a unified prompting framework by covering a range of model families, architectures and parameter sizes.
Rather than identifying the best-performing individual model, we aimed to conduct a fair comparison of recent LLMs with diverse architectures, training corpora and scales, all of which were evaluated under the same prompting conditions. To this end, we deliberately included models with varying numbers of parameters to examine the effect of model size on ZSL and FSL capabilities. It is important to note that all models were evaluated without any fine-tuning or parameter updates, relying solely on prompt-based instructions, as described in Section 3.2.3.
The selected model families are as follows:
  • Google’s Gemma family: We evaluated the following models from this family: Gemma 2 2B and Gemma 2 9B (both from version 2), and Gemma 3 1B. These are all instruction-tuned, decoder-only models, optimized for efficiency and multilingual usage. While the Gemma 2 models prioritize efficiency and accessibility, the Gemma 3 model introduces architectural improvements inherited from Gemini, thereby improving contextual understanding and cross-lingual generalization. Despite its small size, Gemma 3 1B achieves strong FS performance, making it a competitive option for resource-constrained inference.
  • Meta’s LLaMA family: We evaluated LLaMA 2 7B, LLaMA 3.1 8B and LLaMA 3.2 3B, which represent successive generations of Meta’s open-weight LLMs. LLaMA 2 was Meta’s first major release of instruction-tuned models under an open-weight license. While it provided solid performance in English, its multilingual robustness was limited. LLaMA 3.1 significantly improved in this regard, enhancing multilingual understanding and instruction tuning. The most recent release, LLaMA 3.2, optimizes performance in smaller models through architectural and tokenisation improvements, making it ideal for FS tasks with lower resource demands.
  • Alibaba’s Qwen family: We evaluated two multilingual instruction-tuned models with a focus on cross-lingual and multi-task generalization: Qwen 2.5 7B and Qwen 3 8B. Qwen 2.5 was already competitive in cross-lingual tasks and Qwen 3 improves upon this by further enhancing instruction-following and generalization in non-English contexts. Their tokenizer and pretraining data make them particularly suited to tasks involving informal or dialectal Spanish.
  • Mistral AI’s Mistral family: We evaluated two models: Mistral 7B, a dense model; and Mixtral 8x7B, a Mixture-of-Experts (MoE) model. Mistral models are designed for efficiency and strong performance in FS settings, and Mixtral combines scalability with reduced inference cost by activating only a subset of its expert layers per input, making it a promising choice for ICL under realistic constraints. Due to GPU limitations, we applied 4-bit quantization via the BitsAndBytes library to reduce memory requirements for the Mixtral 8x7B model, while all other models were loaded in full precision.
The selected models were chosen for the following reasons: (1) all are publicly available under open-source licenses, making them suitable for reproducible research; (2) they have demonstrated strong instruction, following and ICL capabilities in prior multilingual benchmarks; and (3) their diverse design allows us to explore the influence of scale, architecture and training data coverage on ICL effectiveness in Mexican Spanish.
All ZS and FS experiments were carried out using the Hugging Face Transformers library and PyTorch with the AutoModelForCausalLM interface. To ensure comparability and determinism, we fixed the random seed for Python, NumPy and PyTorch. We also employed an 80/20 training/validation split of the datasets, using only the validation portion for evaluation, since no training was performed. Text inputs were minimally preprocessed by concatenating the title and opinion/review fields into a single sequence (Text = Title + Opinion/Review), while preserving diacritics and casing. In ZS, the prompt consisted solely of task instructions followed by the input text. In FS, the same set of five examples was randomly sampled once from the training split and kept fixed across all models and runs to avoid variability due to prompt construction. For each instance, the prompts were tokenized using the model’s native tokenizer and fed into the model’s chat template. The system/user/assistant roles were adapted depending on the model family (e.g., LLaMA, Gemma, Qwen, Mistral). Inference was performed with greedy decoding, using a maximum of 32 new tokens and without updating model parameters.

3.2.3. Prompts Used for ZSL and FSL Approaches

In the context of LLMs, a prompt is the input text that guides the model’s behavior for a given task. This can be a sentence, phrase or even an entire paragraph, and usually includes a task description and contextual cues that define the expected outcome. In FSL, it also includes a set of examples to elicit the desired output from the model without updating its parameters. In the context of ICL, designing the input prompt is critical, as it directly impacts model performance, since model outputs are entirely determined by the information contained within the prompt.
For both ZSL and FSL in this evaluation, we carefully designed task-specific prompts to elicit accurate SA responses from the evaluated models. These prompts were designed to be concise and interpretable, and to adapt to the nature of the user-generated content present in the REST-MEX datasets.
As some of the models support a structured system/user/assistant role format, two distinct prompts were designed, adapted to the capabilities of the different models. Specifically, we distinguished between models that support these structured dialogue roles and those that do not. This enabled us to leverage system-level instruction conditioning where applicable.
For models with system-role support, the prompt was structured as follows: (1) a system message providing high-level behavioral guidance: You are a classification model that is really good at following instructions. Please follow the user’s instructions as precisely as possible.; and (2) a user message containing task-specific instructions. In contrast, for models without system-role support, the prompt was presented as a single plain text block containing these instructions.
In the ZSL setting, the prompt only included task-specific instructions, which were followed directly by the test input. The ZSL prompt were as follows (Listing 1):
Listing 1. ZSL Prompt.
Your task will be to classify a text that contains an opinion issued by a tourist. Each opinion’s class is an integer between [1, 5], where 1 represents the most negative polarity and 5 the most positive.
Please respond only with a single label that you think fits the text best.
Classify the following piece of text: {{Text}}
In the FSL setting, the prompt (Listing 2) included the same task-specific instructions, followed by five randomly selected examples from the training set, that is, examples that will not appear in the inference (test) set. These examples were fixed for all models and all experiments to ensure a fair comparison of different architectures and sizes. The FSL prompt was as follows:
Listing 2. FSL Prompt.
Your task will be to classify a text that contains an opinion issued by a tourist. Each opinion’s class is an integer between [1, 5], where 1 represents the most negative polarity and 5 the most positive.
Please respond only with a single label that you think fits the text best.
Here are some examples given by experts:
Text: {{Example Text}}
Label: {{Example Label}}
Classify the following piece of text: {{Text}}
In both cases, the task-specific instructions remained the same. This prompt strategy enabled a controlled comparison of models with different capabilities, leveraging instruction-following alignment where available. To standardize the outputs, the responses generated by the models were post-processed using regular expressions to extract only the numerical label corresponding to the predicted sentiment class. Despite this step, there were occasional instances where the model’s response was unrelated to the prompt or did not contain a valid label. In such cases, the output was replaced with the most frequent label in the dataset, that is, the label 5 for both REST-MEX datasets.
All experiments were conducted on a dedicated server equipped with four NVIDIA GeForce RTX 4090 GPUs (24 GB of memory each), an AMD EPYC 7313 CPU, and 1 TB of RAM. The server ran Ubuntu 22.04 and CUDA 12.4. This configuration ensured sufficient resources for both fine-tuning encoder-based models and evaluating large decoder-based LLMs. Training times and memory usage were consistent across runs, and 4-bit quantization was employed when necessary (e.g., for Mixtral 8x7B) to enable efficient inference. To further support reproducibility, the source code and configuration files are available in a public repository (see the Data Availability Statement).

4. Results

This section presents the results of our comparative study of fine-tuning and ICL-based strategies. The latter encompasses both ZSL and FSL approaches. The aim is to evaluate how well each learning paradigm adapts to SA, with a focus on the polarity detection task in Mexican Spanish, using a unified evaluation framework.
Due to the ordinal nature of the polarity detection task, all models are evaluated across five polarity classes (from 1 to 5, with 1 being the most negative and 5 the most positive). Models are ranked using macro-averaged F1 scores. We also report the Mean Absolute Error (MAE), which reflects general classification and ordinal performance, respectively, and macro-averaged precision and recall, which assess performance across all classes without favoring more frequent labels.
Additionally, we perform an error analysis to analyze the misclassifications made by the highest-performing models under each approach (fine-tuning, ZSL and FSL) using confusion matrices. The outcome is different if the model identifies a review as five points when it is actually one point, or if it misclassify a review as four points.
Despite the class imbalance in the datasets, no rebalancing techniques were applied during training or evaluation to maintain comparability across models and learning settings within a unified evaluation framework.
To facilitate a structured and coherent analysis, the evaluation is organized by dataset. Section 4.1 focuses on the experiments conducted with the REST-MEX 2022 dataset, while Section 4.2 addresses the results obtained using the 2023 version.

4.1. Rest-Mex 2022 Results

This section presents an analysis of the results obtained from experiments conducted on the REST-MEX 2022 dataset. Table 2 presents the overall performance of all the evaluated models using the fine-tuning, zero- and few-shot strategies on the REST-MEX 2022 dataset.
As shown in Table 2, the fine-tuning approach clearly outperforms ZSL and FSL across all evaluated metrics. Specifically, the MarIA model consistently outperforms others, achieving the best results across most metrics and obtaining the highest accuracy (76.89%), the lowest MAE (0.2521) and the highest macro-averaged F1-score (57.47%). ZSL follows closely behind: Mixtral 8x7B achieves an accuracy of 76.03%, an MAE of 0.2713 and an F1-score of 55.41%. Meanwhile, FSL, led by Mistral 7B, achieves an accuracy of 75.87%, an MAE of 0.2742 and an F1-score of 50.27%. These results suggest that fine-tuning provides the best outcomes when labeled data is available, and that modern LLMs can achieve similar results in a ZS setting without supervision. FS setups also offer a practical compromise when only a few examples are provided.
Furthermore, the greater variability in the performance of ZSL and FSL strategies compared to the consistent results obtained with fine-tuned models is notable. Specifically, when analyzing the macro-averaged F1-score of the models evaluated under the ZSL approach, the average F1-score obtained by the models is 37.284, with a standard deviation of 11.407, indicating a considerable variation in model performance. In contrast, FSL models show a slightly lower average F1-score of 37.017 and an even smaller standard deviation of 9.564, suggesting that FS setups are less sensitive to the model architecture and size. In comparison, the fine-tuning approach exhibits a higher average performance and lower dispersion, highlighting its reliability when sufficient labeled data is available. These results emphasise that although ZSL and FSL can yield competitive results in the ideal conditions, their performance is more inconsistent, making careful model selection crucial in low-resource or prompt-based settings.
When the results produced by the ZSL and FSL approaches are compared using the same models, it is evident that FSL does not consistently outperform ZSL. In fact, while some models benefit from the inclusion of examples in the prompt, others show marginal gains or even drops in performance. For example, Mistral 7B shows the greatest improvement, raising its F1-score from 30.20 (ZSL) to 50.27 (FSL), a remarkable gain of +20.07 points, confirming its strong ability to use FS examples effectively. Conversely, Mixtral 8x7B performance worsens, dropping from 55.41 to 48.16, which is a decrease of −7.25, suggesting that its ZSL configuration is already optimal and additional examples could introduce redundancy or noise. Similarly, Gemma 2 9B, LLaMA 3.1 8B and Qwen 2.5 7B show significant decreases in F1-score when transitioning from ZSL to FSL (−11.03, −5.86 and −7.5 points, respectively), suggesting that these models may be more susceptible to prompt structure or overfitting to FS examples. Overall, while FSL often improves performance, particularly for models with strong instruction-following capabilities, this is not guaranteed and should be considered on a model-by-model basis.
Next, Table 3 presents the class-wise performance of all the models that were evaluated using the REST-MEX 2022 dataset.
As shown in Table 3, there is an asymmetry in the detection of sentiment extremes across all three approaches: highly positive opinions (class 5) are more easily identified than very negative ones (class 1). Under fine-tuning, MarIA achieves F1-scores of 88.22% and 57.45% for classes 5 and 1, respectively. In ZSL, models such as Mistral 7B and Mixtral 8x7B achieve F1-scores of over 87% for class 5, but their performance for class 1 remains between 52% and 56%. Similarly, in FSL, Mistral 7B achieves F1-scores of 88.18% and of 58.23% for class 5 and 1, respectively. These results suggest that all methods are more effective at identifying strongly positive sentiments, whereas detecting highly negative ones still requires additional data or illustrative examples to achieve comparable accuracy. Conversely, intermediate sentiment classes (classes 2, 3 and 4) demonstrate the poorest performance across all approaches, likely due to their semantic ambiguity. In the fine-tuning setting, MarIA achieves F1-scores of 50.58% and 40.61% for classes 3 and 2, respectively. ZSL results are similar, with Mixtral 8x7B achieving F1-scores of 50.00% and 39.38% for classes 3 and 2, respectively. In the FSL scenario, Mistral 7B obtains F1-scores of 47.31% and 25.00% for classes 3 and 2, respectively. These findings confirm that subtle sentiment distinctions remain a key challenge, even when examples or prior fine-tuning are available.
In conclusion, fine-tuning with MarIA is the most effective strategy for performing SA on the REST-MEX 2022 dataset when sufficient labeled data is available, delivering the best results in terms of accuracy, MAE and macro F1-score. However, in low-resource settings, the ZSL approach led by Mixtral 8x7B and the FSL approach led by Mistral 7B are viable alternatives. ZSL nearly matches the performance of fine-tuning without requiring any supervision, and FSL substantially improves upon base models with just a few labeled samples. Ultimately, the choice of method depends on the annotation budget and the trade-off between achieving maximum accuracy and minimizing labeling time and cost.
To gain deeper insights into the classification behavior of each model beyond aggregate metrics such as the F1-score, we analyzed the confusion matrices of the best-performing models under each approach: fine-tuning, ZSL and FSL. These matrices offer a detailed view of how each sentiment class is predicted, revealing patterns of confusion between categories that may be obscured by overall performance metrics. Figure 1 presents the confusion matrices of all strategies.
As shown in Figure 1a, the MarIA model performed strongly across sentiment categories, highlighting its adaptability to the task. Class 1 (very negative sentiment) showed significant overlap with class 2 (negative sentiment), suggesting difficulty in distinguishing between these closely related negative categories. Similarly, class 2 was often misclassified as class 3 (neutral), reflecting difficulties in identifying mildly negative sentiments. Class 3 (neutral) achieved moderate accuracy, but was often misclassified as positive (classes 4 and 5), revealing uncertainty at the neutral-positive boundary. Class 4 (positive sentiment) was generally well classified, though there was considerable misclassification into class 5 (highly positive sentiment), indicating difficulty in differentiating fine-grained positive sentiments. In contrast, class 5 was identified with high precision, demonstrating the model’s effectiveness in recognizing strongly positive sentiment.
As shown in Figure 1b, the Mixtral 8x7B model produced reliable classifications of extreme sentiments. Class 1 (very negative sentiment) exhibited significant overlap with class 2 (negative sentiment), suggesting challenges in distinguishing between closely related negative categories. Class 2 was frequently misclassified, especially into class 1 and class 3 (neutral), suggesting difficulty in identifying mildly negative sentiment without task-specific supervision. Class 3 proved particularly challenging, with widespread misclassifications into both negative and positive classes, suggesting that the model struggled to capture neutral sentiment accurately. Class 4 (positive sentiment) was often confused with class 5 (highly positive sentiment), while class 5 achieved excellent accuracy, reflecting the model’s ability to detect strongly positive sentiment even in a ZS setting.
As shown in Figure 1c, the Mistral 7B model exhibited a distinct pattern of performance of strengths and weaknesses. Class 1 (very negative sentiment) exhibited improved accuracy compared to the ZSL approach, though confusion with class 2 (negative sentiment) persisted. Class 2 remained problematic, frequently being misclassified as either class 1 or class 3 (neutral), highlighting the difficulty in identifying mildly negative sentiment with limited examples. Class 3 predicted fairly accurately, but there was high confusion with the positive classes (classes 4 and 5), reflecting the ongoing challenge of differentiating neutral sentiments from positive ones. Class 4 (positive sentiment) was often confused with class 5 (highly positive sentiment), indicating difficulty in distinguishing nuanced positive expressions. Class 5 achieved high precision, with only minor misclassifications into class 4, which reinforces the model’s strength in recognizing strong positive sentiment.
In summary, the confusion matrices confirm previous findings: all models perform well in identifying highly positive sentiment (class 5), but struggle with mildly negative (class 2) and neutral (class 3) classes. A common issue across all approaches is accurately discriminating between adjacent classes, especially when transitioning from negative to neutral or neutral to positive sentiments. While the fine-tuned MarIA model provides the most balanced and accurate overall predictions, both Mixtral 8x7B (ZSL) and Mistral 7B (FSL) effectively classify sentiment extremes with minimal supervision. These results highlight the need for more sophisticated modelling strategies or more extensive training data to enhance the classification of ambiguous or subtle sentiment expressions.

4.2. Rest-Mex 2023 Results

This section presents an analysis of the results obtained from experiments conducted on the REST-MEX 2023 dataset. Table 4 presents the overall performance of all the evaluated models using the fine-tuning, zero- and few-shot strategies on the REST-MEX 2023 dataset.
As shown in Table 4, the fine-tuning approach once again achieves the best overall performance in the REST-MEX 2023 dataset. In this edition, the MarIA model leads again with an accuracy of 71.52%, an MAE of 0.3048 and a macro F1-score of 59.37%, which is a slight improvement on the macro F1-score obtained in the previous edition. The ZSL strategy offers the second-best performance, with Mixtral 8x7B achieving an accuracy of 70.83%, an MAE of 0.3229 and an F1-score of 55.96%. The FSL approach also produces solid results: Mistral 7B achieves an accuracy of 75.87%, an MAE of 0.2742 and an F1-score of 50.27%. These results confirm that, when labeled data is available, fine-tuning remains the most effective strategy. However, they also demonstrate that modern LLM models can achieve similar performance without direct supervision. Furthermore, they demonstrate that FSL is an effective intermediate solution when only a few annotated examples are available.
As observed in the REST-MEX 2022 experiments, the performance of ZS and FS strategies on the REST-MEX 2023 dataset exhibits greater variability when compared to fine-tuning. Specifically, the macro-averaged F1-scores show that the ZSL models have a mean of 34.69 and a standard deviation of 11.917, while the FSL models have a slightly higher average F1-score of 36.5, with a comparable standard deviation of 9.075. This high dispersion suggests that both paradigms are more sensitive to characteristics such as architecture, training data, and alignment with the instruction-following objective. Unlike the consistent and balanced performance of fine-tuned models, led again by MarIA, these results highlight the importance of careful model selection when working in low-resource scenarios or relying solely on prompt-based learning.
A model-by-model comparison of ZSL and FSL performances reveals that the benefits of providing FSL examples are not uniform across all LLMs. In some cases, FSL significantly outperforms ZSL; in others, however, it fails to yield better results. The Mistral 7B model stands out once again, with its F1-score jumping from 23.74 in ZSL to 50.27 in FSL, a substantial improvement of +26.53 points, demonstrating the model’s strong ability to benefit from in-context examples. A smaller but similar gain is seen with Gemma 3 1B, which increases sharply from 12.80 to 19.63 (+6.83 points). However, not all models benefit from this transition. Mixtral 8x7B, which performed strongly under ZSL (55.96), shows a performance drop to 42.99 in FSL (−12.97 points), suggesting that its instruction tuning already sufficiently captures the task. A similar but smaller decrease is seen with LLaMA 2 7B, which falls from 42.42 to 38.10 (−4.32 points). These findings confirm that FSL does not consistently outperform ZSL, highlighting the need for model-specific evaluation when selecting an inference strategy for practical applications.
Table 5 presents the class-wise performance of all the models that were evaluated using the REST-MEX 2023 dataset.
As shown in Table 5, there is still an asymmetry in the detection of polarity extremes: highly positive opinions (class 5) are easier to identify than very negative ones (class 1). Using the fine-tuning approach, MarIA achieves F1-scores of 83.29% and 60.70% for classes 5 and 1, respectively. In ZSL, Mixtral 8x7B obtains F1-scores of 84.01% and 60.35% for classes 5 and 1, respectively. Mistral 7B achieves similar values in the case of FSL: 88.18% for class 5 and 58.23% for class 1. These results suggest that while the models perform well with highly positive sentiments, the detection of extremely negative sentiments requires additional data or illustrative examples to achieve comparable accuracy.
Conversely, the intermediate classes (2, 3 and 4) remain the most challenging for all approaches, with the poorest model performance observed in these classes. In the case of fine-tuning, MarIA achieves F1-scores of 54.98% and 46.22% for classes 3 and 2, respectively. In ZSL, Mixtral 8x7B achieves F1-scores of 49.56% and 40.42% for classes 3 and 2, respectively. Using the FSL approach, Mixtral 8x7B achieves F1-scores of 46.43% and 31.16% for classes 3 and 2, respectively. These results confirm the difficulty of capturing subtle nuances of polarity, even when examples are provided or prior fine-tuning is performed. The semantic ambiguity and conceptual proximity between these classes make their precise separation challenging.
In conclusion, when sufficient labeled data is available, fine-tuning with MarIA is the most effective strategy for performing SA on the REST-MEX 2023 dataset, delivering optimal accuracy, MAE and macro F1-score. However, in contexts with limited resources, the ZSL approach led by Mixtral 8x7B and the FSL approach led by the Mixtral family models such as Mixtral 8x7B and Mistral 7B are both viable alternatives. ZSL produces results that are remarkably close to those of fine-tuning without supervision, while FSL substantially improves on baseline results using just a few annotated examples. The most appropriate strategy will depend on the annotation budget and the desired balance between accuracy and labelling cost.
As in the previous section, to gain deeper insights into the classification behavior of each model beyond aggregate metrics such as the F1-score, we analyzed the confusion matrices of the best-performing models under each approach: fine-tuning, ZSL and FSL. These matrices provide a detailed view of sentiment prediction patterns, enabling us to identify where the models excel or struggle in distinguishing between sentiment categories. Figure 2 presents the confusion matrices of all strategies.
As shown in Figure 2a, the MarIA model performed strongly across sentiment classes, reaffirming its adaptability to the classification task. Class 1 (very negative sentiment) was correctly predicted in most cases, although there was some confusion with class 2 (negative sentiment) and class 3 (neutral), indicating an ongoing difficulty in distinguishing closely related negative and neutral sentiments. Class 2 achieved solid accuracy, but showed substantial confusion with class 3, and to a lesser extent with classes 1 and 4. This indicates persistent ambiguity in capturing mildly negative sentiment. Class 3 was generally recognized, although frequent misclassifications into class 4 (positive sentiment) highlighted the difficulty of clearly distinguishing neutral from slightly positive sentiment. Class 4 was robustly classified; however, confusion with class 5 (highly positive sentiment) indicated difficulty in distinguishing between fine-grained positive classes. Class 5 was identified with high accuracy and minimal confusion, demonstrating the model’s effectiveness in recognizing strong positive sentiment.
As shown in Figure 2b, the Mixtral 8x7B model performed robustly in classifying extreme sentiments. While class 1 (very negative sentiment) was mostly classified correctly, there was noticeable confusion with class 2 (negative sentiment), reflecting difficulty in distinguishing between adjacent negative classes. Class 2 showed significant confusion with neighboring classes, particularly class 1 and class 3 (neutral), highlighting the difficulty of capturing mild negativity without fine-tuning. Class 3 remained one of the most problematic classes, frequently being misclassified as classes 2 and 4 (positive sentiment), revealing consistent ambiguity in sentiment neutrality. Class 4 showed strong classification performance, but was often misclassified as class 5 (highly positive sentiment), indicating difficulty in distinguishing subtle positive expressions. Class 5 achieved excellent accuracy with minimal confusion, highlighting the model’s ability to reliably recognize highly positive sentiment even without supervision.
As shown in Figure 2c, the Mistral 7B model exhibited a distinct pattern of strengths and weaknesses. Class 1 (very negative sentiment) demonstrated enhanced accuracy compared to the ZSL approach, with reduced confusion; however, minor misclassifications into class 2 (negative sentiment) persisted. Class 2 continued to pose difficulties, with frequent misclassification into classes 1 and 3 (neutral), indicating challenges in distinguishing mildly negative sentiment with limited supervision. Class 3 exhibited moderate accuracy, but was frequently misclassified as classes 2 and 4 (positive sentiment), suggesting difficulties in distinguishing neutrality from mildly positive or negative sentiment. Class 4 performed strongly but was still frequently confused with class 5 (highly positive sentiment), indicating limited resolution between adjacent positive classes. Class 5 was classified accurately, with only minor confusion with class 4, confirming the model’s ability to recognize highly positive sentiment effectively.
In summary, analyzing the confusion matrices for the REST-MEX 2023 dataset confirms and extends the patterns observed in the 2022 evaluation. Across all approaches, extreme sentiment classes, particularly those indicating highly positive sentiment (class 5), are identified with high accuracy. However, intermediate classes, particularly the mildly negative (class 2) and neutral (class 3) classes, remain more challenging due to their subtle distinctions. The fine-tuned MarIA model continues to deliver the most balanced classification, demonstrating robust performance across the sentiment spectrum, although it exhibits some residual confusion between adjacent classes. Mixtral 8x7B (ZSL) recognizes sentiment extremes well but struggles with neutral and mildly negative sentiments, highlighting the limitations of unsupervised settings. Mistral 7B (FSL) strikes a balance, improving performance over ZSL, particularly in class 1, while maintaining high accuracy in class 5; however, it still struggles with the classification of classes 2 through 4. Overall, these findings highlight the enduring challenge of modeling ambiguous or intermediate sentiments, pointing to the need for enhanced contextual understanding or targeted supervision to better resolve sentiment boundaries.

5. Discussion

This comparative analysis provides valuable insights into the effectiveness of fine-tuning and ICL-based approaches (ZSL and FSL) for the SA task in Mexican Spanish when applied to tourism-related content.
Firstly, the experiments confirm that fine-tuning language models, particularly those explicitly trained for Spanish, consistently outperforms both ZSL and FSL approaches. Notably, the MarIA model, which was fine-tuned on large-scale Spanish corpora, achieved the highest overall accuracy and balanced performance across all sentiment categories. This demonstrates the inherent advantage of linguistic alignment when sufficient labeled data is available. Nevertheless, the ICL-based approaches, particularly the ZSL configuration with the Mixtral 8x7B model and the FSL configuration with the Mistral 7B model, demonstrate competitive performance, almost matching the accuracy of fine-tuning strategies. This undercsores the considerable potential of modern LLMs in scenarios where annotated data is limited, costly, or difficult to obtain.
Interestingly, both fine-tuning and ICL-based approaches consistently struggle with intermediate sentiment classes (classes 2, 3 and 4). This issue arises due to the semantic ambiguity that is inherent in moderately polarized content, suggesting that capturing nuanced expressions of sentiment remains a significant challenge for current NLP techniques. Future research should explore more nuanced prompt engineering strategies or hybrid models that combine fine-tuning with in-context exemplification in order to address this limitation. Additionally, models across all strategies demonstrated greater proficiency in identifying highly positive sentiments (class 5) than highly negative sentiments (class 1), which could indicate a positive bias in the training data or an inherent linguistic complexity associated with negative expressions. This finding highlights the importance of creating balanced datasets and calls for further investigation into methods that specifically target negative sentiment detection.
In terms of the research questions defined in this study, our analysis provides the following insights:
Regarding RQ1, ICL-based approaches demonstrated highly competitive performance for SA tasks in Mexican Spanish tourism texts. In particular, in ZS and FS scenarios, models such as Mixtral 8x7B and Mistral 7B achieved performance metrics that were close to those obtained by fine-tuned models. This confirms the effectiveness of ICL techniques as a viable alternative in situations where resources are limited.
Turning to RQ2, while ICL-based approaches yielded promising results, fine-tuning, particularly with Spanish-specific models such as MarIA, demonstrate superior performance. This suggests that, when sufficient labeled data is available, fine-tuning is the most effective strategy, particularly for accurately detecting moderately polarized sentiments.
In response to RQ3, Mixtral 8x7B demonstrated the most robust performance among the evaluated LLMs using ICL-based approaches in ZS scenarios, while Mistral 7B performed particularly well in FS conditions. These findings suggest that the scale of the model and the specific ICL-based approach significantly influence performance, favoring larger, more recent models.
Finally, significant differences were observed between the various ICL-based approaches in response to RQ4. The FS approach generally improved performance compared to ZS methods, particularly with medium-sized models such as Mistral 7B. This suggests that a small number of in-context annotated examples can considerably improve model performance and mitigate challenges related to semantic ambiguity in intermediate sentiment classes.
In terms of practical implications, the findings suggest that, for applications prioritizing maximum accuracy and possessing sufficient labeled datasets, fine-tuning Spanish-specific models such as MarIA remains optimal. Conversely, in settings whith limited resources, or where rapid deployment without extensive annotation processes is necessary, ZSL and FSL configurations represent highly viable alternatives, offering significant efficiency gains with minimal performance trade-offs.
Ultimately, the choice between fine-tuning and ICL-based approaches should depend on the specific context, including the availability of annotated data and computational resources, and the necessity of capturing nuanced distinctions in sentiment. Further studies could explore hybrid or adaptive strategies that combine the strengths of both paradigms to enhance the overall performance of real-world SA tasks.
Overall, this study’s findings highlight the effectiveness of fine-tuning and ICL strategies for SA in Mexican Spanish tourism-related texts. They also demonstrate the potential of these strategies in designing next-generation intelligent and connected systems. These approaches can support the development of smart applications in areas such as tourism, public services and urban management, as they enable real-time opinion mining across large-scale, multilingual and noisy social data streams. Furthermore, they improve human–machine interaction and enhance the user experience while promoting the inclusion of underrepresented linguistic communities in digital platforms. Together, these aspects suggest that advances in SA using LLMs could lay the groundwork for more adaptive, trustworthy, and human-centred digital ecosystems.

6. Conclusions and Further Work

This study thoroughly evaluates SA strategies in Mexican Spanish by comparing fine-tuning with ICL-based approaches in ZSL and FSL contexts. Experiments conducted on the REST-MEX 2022 and 2023 datasets revealed that, when it comes to Spanish, fine-tuning encoder-based models such as MarIA consistently delivers the best overall performance. However, several instruction-tuned, decoder-based LLMs demonstrated competitive results in ZSL and FSL configurations. This shows that these paradigms are viable alternatives when labeled data is scarce. Notably, the top-performing LLMs were able to approximate fine-tuned performance with little to no supervision.
Looking ahead, future work could explore more intelligent selection methods for the examples used in FSL. For example, clustering, diversity heuristics or uncertainty-based sampling could be employed to increase effectiveness. It would also be valuable to analyze the impact of varying the number of in-context examples (number of shots) on model performance, as well as adapting the prompts to better align with the capabilities of each model family. Furthermore, as newer, more efficient LLMs become available, especially those optimized for multilingual or resource-constrained settings, it would be worthwhile to re-evaluate the trade-offs between performance and computational cost. Finally, efficiency-focused investigations, such as quantization, model distillation, or dynamic prompting, may help to bridge the gap between accuracy and practical deployment in real-world scenarios.
Beyond comparing full fine-tuning and ICL strategies, it is important to acknowledge that our experiments did not include other adaptation and alignment strategies for LLMs. Promising directions that could further improve model performance in the SA task include PEFT methods such as LoRA or adapters, instruction tuning on instruction–response pairs, and alignment techniques like RLHF or CAI. However, applying these methods requires additional resources, specialized datasets or alignment infrastructures that are not yet readily available for Mexican Spanish. Exploring these approaches is an interesting avenue for future work, particularly in balancing performance with efficiency and ensuring fair and culturally sensitive SA in underrepresented languages.
In addition to their technical contribution to NLP, the approaches explored in this study have the potential to enable intelligent, data-driven digital services. Future research could extend these methods to real-time pipelines integrated into smart applications, such as tourism platforms, urban information systems and public service monitoring systems. This would improve user experience, trust and inclusivity within large-scale socio-technical ecosystems.

Author Contributions

Conceptualization, R.V.-G. and M.A.P.-V.; methodology, M.d.P.S.-Z.; software, T.B.-B.; validation, T.B.-B., M.A.P.-V. and J.A.G.-D.; formal analysis, M.d.P.S.-Z.; investigation, T.B.-B.; resources, M.A.P.-V.; data curation, T.B.-B.; writing—original draft preparation, T.B.-B.; writing—review and editing, M.d.P.S.-Z.; visualization, J.A.G.-D.; supervision, R.V.-G.; project administration, R.V.-G.; funding acquisition, R.V.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Secretary of Science, Humanities, Technology, and Innovation (SECIHTI) under project CIORGANISMOS-2025-119, “Collaborative epidemiological surveillance and prevention of infections diseases based on emerging models and methods of intelligent data analysis”. Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.

Data Availability Statement

Source code for training the fine-tuning, zero- and few-shot models is available at https://github.com/NLP-UMUTeam/future-internet-2025-sa-icl (accessed on 29 July 2025). No new data were created in this research. Therefore, it is necessary to request the datasets from the original authors of each paper evaluated in this work.

Acknowledgments

Authors María del Pilar Salas-Zárate and Mario Andrés Paredes-Valverde gratefully acknowledge the support provided by the Secretary of Science, Humanities, Technology, and Innovation (SECIHTI) and the Tecnológico Nacional de México (TecNM). Tomás Bernal-Beltrán is supported by University of Murcia through the predoctoral programme.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Gangadharbatla, H.; Bright, L.F.; Logan, K. Social Media and news gathering: Tapping into the millennial mindset. J. Soc. Media Soc. 2014, 3, 1–2. [Google Scholar]
  2. Vosoughi, S.; Roy, D.; Aral, S. The spread of true and false news online. Science 2018, 359, 1146–1151. [Google Scholar] [CrossRef]
  3. Metzler, H.; Garcia, D. Social drivers and algorithmic mechanisms on digital media. Perspect. Psychol. Sci. 2024, 19, 735–748. [Google Scholar] [CrossRef] [PubMed]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 8–9. [Google Scholar]
  5. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 3–5 June 2019; Volume 1, pp. 4171–4186. [Google Scholar]
  6. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
  7. Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-agnostic BERT sentence embedding. arXiv 2020, arXiv:2007.01852. [Google Scholar]
  8. Álvarez-Carmona, M.Á.; Díaz-Pacheco, Á.; Aranda, R.; Rodríguez-González, A.Y.; Fajardo-Delgado, D.; Guerrero-Rodríguez, R.; Bustio-Martínez, L. Overview of rest-mex at iberlef 2022: Recommendation system, sentiment analysis and covid semaphore prediction for mexican tourist texts. Proces. Leng. Nat. 2022, 69, 289–299. [Google Scholar]
  9. Álvarez-Carmona, M.Á.; Díaz-Pacheco, Á.; Aranda, R.; Rodríguez-González, A.Y.; Muñiz-Sánchez, V.; López-Monroy, A.P.; Sánchez-Vega, F.; Bustio-Martínez, L. Overview of rest-mex at iberlef 2023: Research on sentiment analysis task for mexican tourist texts. Proces. Leng. Nat. 2023, 71, 425–436. [Google Scholar]
  10. Zhu, X.; Gardiner, S.; Roldán, T.; Rossouw, D. The Model Arena for Cross-lingual Sentiment Analysis: A Comparative Study in the Era of Large Language Models. In Proceedings of the 14th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Bangkok, Thailand, 15 August 2024; pp. 141–152. [Google Scholar]
  11. Yan, S.; Wang, J.; Song, Z. Microblog Sentiment Analysis Based on Dynamic Character-Level and Word-Level Features and Multi-Head Self-Attention Pooling. Future Internet 2022, 14, 234. [Google Scholar] [CrossRef]
  12. Karabila, I.; Darraz, N.; El-Ansari, A.; Alami, N.; El Mallahi, M. Enhancing Collaborative Filtering-Based Recommender System Using Sentiment Analysis. Future Internet 2023, 15, 235. [Google Scholar] [CrossRef]
  13. Alaei, A.; Wang, Y.; Bui, V.; Stantic, B. Target-Oriented Data Annotation for Emotion and Sentiment Analysis in Tourism Related Social Media Data. Future Internet 2023, 15, 150. [Google Scholar] [CrossRef]
  14. Zhang, L.; Liu, B. Sentiment analysis and opinion mining. In Encyclopedia of Machine Learning and Data Mining; Springer: Berlin/Heidelberg, Germany, 2017; pp. 1152–1161. [Google Scholar]
  15. Pang, B.; Lee, L. Opinion mining and sentiment analysis. Found. Trends® Inf. Retr. 2008, 2, 1–135. [Google Scholar] [CrossRef]
  16. Kim, S.M.; Hovy, E. Determining the sentiment of opinions. In Proceedings of the Coling 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004; pp. 1367–1373. [Google Scholar]
  17. Hu, M.; Liu, B. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 22–25 August 2004; pp. 168–177. [Google Scholar]
  18. Tiffani, I.E. Optimization of naïve bayes classifier by implemented unigram, bigram, trigram for sentiment analysis of hotel review. J. Soft Comput. Explor. 2020, 1, 1–7. [Google Scholar] [CrossRef]
  19. Pak, A.; Paroubek, P. Twitter as a corpus for sentiment analysis and opinion mining. In Proceedings of the LREc, Valletta, Malta, 17–23 May 2010; Volume 10, pp. 1320–1326. [Google Scholar]
  20. Gamon, M. Sentiment classification on customer feedback data: Noisy data, large feature vectors, and the role of linguistic analysis. In Proceedings of the COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland, 23–27 August 2004; pp. 841–847. [Google Scholar]
  21. Zhang, L.; Wang, S.; Liu, B. Deep learning for sentiment analysis: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1253. [Google Scholar] [CrossRef]
  22. Moraes, R.; Valiati, J.F.; Neto, W.P.G. Document-level sentiment classification: An empirical comparison between SVM and ANN. Expert Syst. Appl. 2013, 40, 621–633. [Google Scholar] [CrossRef]
  23. Tang, D.; Qin, B.; Liu, T. Document modeling with gated recurrent neural network for sentiment classification. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1422–1432. [Google Scholar]
  24. Rodríguez-Sánchez, F.; Carrillo-de Albornoz, J.; Plaza, L.; Gonzalo, J.; Rosso, P.; Comet, M.; Donoso, T. Overview of exist 2021: Sexism identification in social networks. Proces. Leng. Nat. 2021, 67, 195–207. [Google Scholar]
  25. Rodríguez-Sánchez, F.; Carrillo-de Albornoz, J.; Plaza, L.; Mendieta-Aragón, A.; Marco-Remón, G.; Makeienko, M.; Plaza, M.; Gonzalo, J.; Spina, D.; Rosso, P. Overview of exist 2022: Sexism identification in social networks. Proces. Leng. Nat. 2022, 69, 229–240. [Google Scholar]
  26. Cañete, J.; Chaperon, G.; Fuentes, R.; Ho, J.H.; Kang, H.; Pérez, J. Spanish Pre-Trained BERT Model and Evaluation Data. In Proceedings of the PML4DC at ICLR 2020, Addis Ababa, Ethiopia, 26 April 2020; pp. 3–4. [Google Scholar]
  27. Fandiño, A.G.; Estapé, J.A.; Pàmies, M.; Palao, J.L.; Ocampo, J.S.; Carrino, C.P.; Oller, C.A.; Penagos, C.R.; Agirre, A.G.; Villegas, M. MarIA: Spanish Language Models. Proces. Leng. Nat. 2022, 68, 3–8. [Google Scholar] [CrossRef]
  28. Xian, Y.; Lampert, C.H.; Schiele, B.; Akata, Z. Zero-shot Learning-A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2251–2265. [Google Scholar] [CrossRef]
  29. Perez, E.; Kiela, D.; Cho, K. True few-shot learning with language models. Adv. Neural Inf. Process. Syst. 2021, 34, 11054–11070. [Google Scholar]
  30. Mu, J.; Wang, W.; Liu, W.; Yan, T.; Wang, G. Multimodal large language model with lora fine-tuning for multimodal sentiment analysis. ACM Trans. Intell. Syst. Technol. 2024, 16, 7–11. [Google Scholar] [CrossRef]
  31. Zhang, B.; Yang, H.; Liu, X.Y. Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models. arXiv 2023, arXiv:2306.12659. [Google Scholar] [CrossRef]
  32. Sumers, T.R.; Ho, M.K.; Hawkins, R.D.; Narasimhan, K.; Griffiths, T.L. Learning rewards from linguistic feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 6002–6010. [Google Scholar]
  33. Bai, Y.; Kadavath, S.; Kundu, S.; Askell, A.; Kernion, J.; Jones, A.; Chen, A.; Goldie, A.; Mirhoseini, A.; McKinnon, C.; et al. Constitutional ai: Harmlessness from ai feedback. arXiv 2022, arXiv:2212.08073. [Google Scholar] [CrossRef]
  34. de la Rosa, J.; Ponferrada, E.G.; Villegas, P.; de Prado Salas, P.G.; Romero, M.; Grandury, M. BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling. Proces. Leng. Nat. 2022, 68, 13–23. [Google Scholar]
  35. Cañete, J.; Donoso, S.; Bravo-Marquez, F.; Carvallo, A.; Araujo, V. ALBETO and DistilBETO: Lightweight Spanish Language Models. In Proceedings of the 13th Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022. [Google Scholar]
Figure 1. Confusion matrices on the REST-MEX 2022 evaluation dataset for the fine-tuning with MarIA (a), ZSL with Mixtral 8x7B (b) and FSL with Mistral 7B (c) strategies. Darker colors indicate a higher number of predictions in each actual–predicted pair, while lighter colors indicate fewer predictions.
Figure 1. Confusion matrices on the REST-MEX 2022 evaluation dataset for the fine-tuning with MarIA (a), ZSL with Mixtral 8x7B (b) and FSL with Mistral 7B (c) strategies. Darker colors indicate a higher number of predictions in each actual–predicted pair, while lighter colors indicate fewer predictions.
Futureinternet 17 00445 g001
Figure 2. Confusion matrices on the REST-MEX 2023 evaluation dataset for the fine-tuning with MarIA (a), ZSL with Mixtral 8x7B (b) and FSL with Mistral 7B (c) strategies. Darker colors indicate a higher number of predictions in each actual–predicted pair, while lighter colors indicate fewer predictions.
Figure 2. Confusion matrices on the REST-MEX 2023 evaluation dataset for the fine-tuning with MarIA (a), ZSL with Mixtral 8x7B (b) and FSL with Mistral 7B (c) strategies. Darker colors indicate a higher number of predictions in each actual–predicted pair, while lighter colors indicate fewer predictions.
Futureinternet 17 00445 g002
Table 2. Overall performance metrics for all models evaluated on the REST-MEX 2022 dataset. Metrics include accuracy (A), Mean Absolute Error (MAE), precision (P), recall (R) and macro-averaged F1-score (F1). Percentages (A, P, R, and F1) are expressed in the format XX.XX, while non-percentage values (MAE) are expressed in the format X.XXXX, reflecting their different numerical scales.
Table 2. Overall performance metrics for all models evaluated on the REST-MEX 2022 dataset. Metrics include accuracy (A), Mean Absolute Error (MAE), precision (P), recall (R) and macro-averaged F1-score (F1). Percentages (A, P, R, and F1) are expressed in the format XX.XX, while non-percentage values (MAE) are expressed in the format X.XXXX, reflecting their different numerical scales.
ModelAMAEPRF1
Fine-tuning
MarIA76.890.252160.1155.6357.47
BETO Cased75.790.264457.7252.4454.41
BETO Uncased76.150.262157.3451.5353.56
BERTIN75.920.262259.7251.6853.55
ALBETO76.330.260959.4753.6355.50
DistilBETO75.800.267958.0252.5854.36
ZSL
Gemma 2 2B50.750.647532.9426.6025.93
Gemma 2 9B68.310.412345.2339.6940.50
Gemma 3 1B42.190.804423.0828.8819.02
LLaMA 2 7B62.660.412342.4739.3137.87
LLaMA 3.1 8B65.640.437047.7752.2045.14
LLaMA 3.2 3B35.640.877031.3635.6225.55
Qwen 2.5 7B67.490.345150.8157.4452.78
Qwen 3 8B69.080.405939.1052.0840.44
Mistral 7B75.120.292431.0834.0830.20
Mixtral 8x7B76.030.271356.1456.3155.41
FSL
Gemma 2 2B72.000.384039.2734.3032.67
Gemma 2 9B70.440.430248.4628.6129.47
Gemma 3 1B69.910.447752.4421.6119.63
LLaMA 2 7B73.010.353448.8140.5038.10
LLaMA 3.1 8B61.290.512639.4545.3339.28
LLaMA 3.2 3B66.820.512929.9928.4325.30
Qwen 2.5 7B61.450.465042.3556.3345.28
Qwen 3 8B70.940.376741.6953.5942.01
Mistral 7B75.870.274253.1751.4350.27
Mixtral 8x7B73.650.327950.7947.3548.16
Table 3. Class-wise performance metrics for all models evaluated on the REST-MEX 2022 dataset. Metrics include F1-score for class 1 (FS 1), F1-score for class 2 (FS 2), F1-score for class 3 (FS 3), F1-score for class 4 (FS 4) and F1-score for class 5 (FS 5). All reported values are percentages expressed in the format XX.XX.
Table 3. Class-wise performance metrics for all models evaluated on the REST-MEX 2022 dataset. Metrics include F1-score for class 1 (FS 1), F1-score for class 2 (FS 2), F1-score for class 3 (FS 3), F1-score for class 4 (FS 4) and F1-score for class 5 (FS 5). All reported values are percentages expressed in the format XX.XX.
ModelFS 1FS 2FS 3FS 4FS 5
Fine-tuning
MarIA57.4540.6150.5850.4888.22
BETO Cased51.1635.4747.8650.0287.57
BETO Uncased44.9734.9050.1350.0287.78
BERTIN43.5334.3850.7751.4687.60
ALBETO50.0040.5650.1948.8887.86
DistilBETO46.3437.8150.3949.8387.41
ZSL
Gemma 2 2B14.2801.5025.3621.8266.69
Gemma 2 9B46.6618.4726.3529.7181.29
Gemma 3 1B19.2318.3606.3534.2754.95
LLaMA 2 7B51.0209.7248.2342.6475.64
LLaMA 3.1 8B42.8532.6134.2733.7282.26
LLaMA 3.2 3B16.3215.9821.1224.8649.49
Qwen 2.5 7B57.2731.4547.6847.1180.40
Qwen 3 8B52.4516.2518.7827.4587.27
Mistral 7B52.5415.6647.8437.4488.14
Mixtral 8x7B56.4339.3850.0043.3687.91
FSL
Gemma 2 2B38.7805.9738.1127.6385.53
Gemma 2 9B31.9508.2715.7508.5182.89
Gemma 3 1B03.7300.0009.4502.3582.62
LLaMA 2 7B48.5904.4734.6717.0085.78
LLaMA 3.1 8B38.9421.6619.0136.4080.40
LLaMA 3.2 3B20.0607.1404.4412.8482.05
Qwen 2.5 7B51.7222.9233.7840.5977.39
Qwen 3 8B51.0617.1231.2722.2788.32
Mistral 7B58.2325.0047.3132.6588.18
Mixtral 8x7B49.3226.3644.8134.2686.03
Table 4. Overall performance metrics for all models evaluated on the REST-MEX 2023 dataset. Metrics include accuracy (A), Mean Absolute Error (MAE), precision (P), recall (R) and macro-averaged F1-score (F1). All reported values are presented with the same number of decimal places for consistency. Percentages (A, P, R, and F1) are expressed in the format XX.XX, while non-percentage values (MAE) are expressed in the format X.XXXX, reflecting their different numerical scales.
Table 4. Overall performance metrics for all models evaluated on the REST-MEX 2023 dataset. Metrics include accuracy (A), Mean Absolute Error (MAE), precision (P), recall (R) and macro-averaged F1-score (F1). All reported values are presented with the same number of decimal places for consistency. Percentages (A, P, R, and F1) are expressed in the format XX.XX, while non-percentage values (MAE) are expressed in the format X.XXXX, reflecting their different numerical scales.
ModelAMAEPRF1
Fine-tuning
MarIA71.520.304862.0357.8259.37
BETO Cased71.480.306060.1156.9158.34
BETO Uncased71.490.307061.0857.4159.03
BERTIN67.500.351656.9853.8655.11
ALBETO71.280.307459.9757.0758.33
DistilBETO71.130.310960.2956.5658.26
ZSL
Gemma 2 2B43.040.775735.1927.5926.80
Gemma 2 9B61.010.536134.8927.2328.70
Gemma 3 1B34.240.949918.1522.1312.80
LLaMA 2 7B55.130.496848.9345.1942.42
LLaMA 3.1 8B54.910.580545.5849.0442.25
LLaMA 3.2 3B35.970.907935.0337.2429.51
Qwen 2.5 7B60.150.422643.6748.5244.52
Qwen 3 8B62.230.491139.0450.3240.20
Mistral 7B69.310.357224.7526.9323.74
Mixtral 8x7B70.830.322955.8657.1355.96
FSL
Gemma 2 2B72.000.384039.2734.3032.67
Gemma 2 9B70.440.430248.4628.6129.47
Gemma 3 1B69.910.447752.4421.6119.63
LLaMA 2 7B73.010.353448.8140.5038.10
LLaMA 3.1 8B61.290.512639.4545.3339.28
LLaMA 3.2 3B66.820.512929.9928.4325.30
Qwen 2.5 7B61.450.465042.3556.3345.28
Qwen 3 8B70.940.376741.6953.5942.01
Mistral 7B75.870.274253.1751.4350.27
Mixtral 8x7B65.870.410843.7742.5542.99
Table 5. Class-wise performance metrics for all models evaluated on the REST-MEX 2023 dataset. Metrics include F1-score for class 1 (FS 1), F1-score for class 2 (FS 2), F1-score for class 3 (FS 3), F1-score for class 4 (FS 4) and F1-score for class 5 (FS 5). All reported values are percentages expressed in the format XX.XX.
Table 5. Class-wise performance metrics for all models evaluated on the REST-MEX 2023 dataset. Metrics include F1-score for class 1 (FS 1), F1-score for class 2 (FS 2), F1-score for class 3 (FS 3), F1-score for class 4 (FS 4) and F1-score for class 5 (FS 5). All reported values are percentages expressed in the format XX.XX.
ModelFS 1FS 2FS 3FS 4FS 5
Fine-tuning
MarIA60.7046.2254.9851.6683.29
BETO Cased61.1342.7054.5549.7283.58
BETO Uncased64.2044.4353.4649.5183.54
BERTIN60.4439.6147.8447.3480.32
ALBETO62.2741.9054.1649.8583.47
DistilBETO62.7441.9854.1249.1483.31
ZSL
Gemma 2 2B23.7401.6026.7824.3857.51
Gemma 2 9B34.2411.9522.4728.3375.24
Gemma 3 1B20.7514.6502.6739.1237.99
LLaMA 2 7B45.3107.5048.1545.4765.64
LLaMA 3.1 8B40.0432.0831.7334.7672.61
LLaMA 3.2 3B30.5616.1422.4128.5849.86
Qwen 2.5 7B61.5538.4047.2648.6171.28
Qwen 3 8B53.4515.6019.9429.2282.80
Mistral 7B52.4415.2748.0637.6983.99
Mixtral 8x7B60.3540.4249.5645.4684.01
FSL
Gemma 2 2B38.7805.9738.1127.6385.53
Gemma 2 9B31.9508.2715.7508.5182.89
Gemma 3 1B03.7300.0009.4502.3582.62
LLaMA 2 7B48.5904.4734.6717.0085.78
LLaMA 3.1 8B38.9421.6619.0136.4080.40
LLaMA 3.2 3B20.0607.1404.4412.8482.05
Qwen 2.5 7B51.7222.9233.7840.5977.39
Qwen 3 8B51.0617.1231.2722.2788.32
Mistral 7B58.2325.0047.3132.6588.18
Mixtral 8x7B57.1831.1646.4344.8978.25
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bernal-Beltrán, T.; Paredes-Valverde, M.A.; Salas-Zárate, M.d.P.; García-Díaz, J.A.; Valencia-García, R. Sentiment Analysis in Mexican Spanish: A Comparison Between Fine-Tuning and In-Context Learning with Large Language Models. Future Internet 2025, 17, 445. https://doi.org/10.3390/fi17100445

AMA Style

Bernal-Beltrán T, Paredes-Valverde MA, Salas-Zárate MdP, García-Díaz JA, Valencia-García R. Sentiment Analysis in Mexican Spanish: A Comparison Between Fine-Tuning and In-Context Learning with Large Language Models. Future Internet. 2025; 17(10):445. https://doi.org/10.3390/fi17100445

Chicago/Turabian Style

Bernal-Beltrán, Tomás, Mario Andrés Paredes-Valverde, María del Pilar Salas-Zárate, José Antonio García-Díaz, and Rafael Valencia-García. 2025. "Sentiment Analysis in Mexican Spanish: A Comparison Between Fine-Tuning and In-Context Learning with Large Language Models" Future Internet 17, no. 10: 445. https://doi.org/10.3390/fi17100445

APA Style

Bernal-Beltrán, T., Paredes-Valverde, M. A., Salas-Zárate, M. d. P., García-Díaz, J. A., & Valencia-García, R. (2025). Sentiment Analysis in Mexican Spanish: A Comparison Between Fine-Tuning and In-Context Learning with Large Language Models. Future Internet, 17(10), 445. https://doi.org/10.3390/fi17100445

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop