Next Article in Journal
ST-MAFNet: Spatio-Temporal Multi-Scale Adaptive Fusion Network for Traffic Forecasting
Previous Article in Journal
Less Is More: Principled Diversity in Heterogeneous Anomaly Detection Ensembles
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cross-Lingual Sentiment Classification in Sustainable Mobility: A Zero-Shot Domain Transfer Evaluation Framework

1
Department of Computer Science and Artificial Intelligence, University of the Basque Country UPV/EHU, 20018 Donostia-San Sebastián, Spain
2
Basque Research and Technology Alliance, Pol. Kurutz Gain 10, 20850 Mendaro, Spain
3
Department of Civil Engineering, University of Granada, Campus de Fuentenueva, s/n, 18071 Granada, Spain
*
Author to whom correspondence should be addressed.
AI 2026, 7(6), 216; https://doi.org/10.3390/ai7060216
Submission received: 10 April 2026 / Revised: 15 May 2026 / Accepted: 4 June 2026 / Published: 12 June 2026
(This article belongs to the Section AI Systems: Theory and Applications)

Abstract

This study evaluates zero-shot domain transfer for multilingual sentiment analysis in sustainable urban mobility using XLM-RoBERTa, a transformer pre-trained on social media data and applied to transport reviews without task- or domain-specific fine-tuning. Starting from a manually annotated English corpus of 375 transport-related user reviews, we created sentence-aligned translations in Spanish, French, German, and Italian, yielding a multilingual evaluation dataset of 1875 instances. Results show that the model assigns consistently high confidence to polarized content (mean: 0.76–0.85) and lower confidence to neutral or ambiguous expressions (0.58–0.65), with visible but preliminary cross-lingual variations that require further linguistic validation. Confidence scores are treated as diagnostic indicators of model certainty, not as evidence of correctness or calibration. A qualitative analysis of 113 categorized low-confidence predictions identifies six recurring linguistic patterns associated with model uncertainty (led by translation drift, mixed sentiment, and idiomatic expressions) with substantial inter-annotator agreement (κ = 0.664). By releasing the annotated multilingual dataset and code publicly, this work provides a reproducible exploratory evaluation framework for annotation-scarce, domain-specific multilingual NLP.

1. Introduction

As cities worldwide face challenges from climate change and urban congestion, sustainable urban mobility has emerged as a central policy priority for improving environmental quality, public health, and social inclusion [1]. International frameworks reflect this urgency, including the European Commission’s Sustainable and Smart Mobility Strategy 2020 [2], the United Nations Sustainable Development Goals (Goal 11 and Goal 13) [3], and regional strategies such as the Basque Government’s Estrategia de Movilidad Sostenible de Euskadi 2030 (Basque Sustainable Mobility Strategy 2030) [4]. Translating these commitments into measurable outcomes, however, requires effective monitoring of how citizens experience and perceive mobility services, a gap that traditional evaluation approaches have struggled to fill.
Traditional mobility studies rely on surveys, statistical reports, and simulation-based models [5], which often face limitations in temporal granularity, geographical coverage, and the ability to capture subjective user experiences in near real-time [6]. The widespread adoption of digital platforms has generated an unprecedented volume of user-generated content (UGC), such as online reviews, social media posts, and geolocated comments, which offers a rich, real-time window into public perceptions of transport services [7]. Yet the vast majority of UGC in urban environments is multilingual, particularly in cities such as Brussels, Barcelona, or Montreal, and existing NLP pipelines predominantly operate in English, introducing significant representational bias when applied to linguistically diverse populations [8].
Sentiment analysis refers to the computational identification and classification of subjective opinions, attitudes, and emotions expressed in text [9]. In this work, sentiment polarity is operationalized as a three-class label (positive, neutral, or negative) reflecting the overall evaluative stance of a user toward a transport service or experience. Sentiment analysis has been increasingly applied to transportation research over the past decade, evolving from lexicon-based methods [9,10], which offered interpretability but struggled with contextual awareness and informal, domain-specific language, to transformer architectures [11] and large pretrained language models [12], which substantially improved both performance and cross-domain generalization.
In the transport domain, sentiment-oriented and satisfaction-related analyses have been applied across a range of contexts: from social media-based measurement of transit rider sentiment [13] to survey-based modelling of user satisfaction in public transport [14], to domain-specific review classification for sustainable mobility policy [15]. Despite this progress, most existing work operates on monolingual English corpora, leaving multilingual urban contexts largely underserved. Addressing this gap requires moving beyond English-centric pipelines and leveraging the cross-lingual capabilities of modern transformer models. The broader NLP community has developed powerful tools in this direction: multilingual transformer models such as XLM-RoBERTa [16], mBERT, and XLM [17] have achieved state-of-the-art results in zero-shot and few-shot sentiment tasks [18,19], supported by evaluation benchmarks such as XTREME [20] and MASSIVE [21], and by large-scale translation initiatives including M2M-100 [22] and No Language Left Behind [23]. Yet the application of these models to domain-specific contexts such as sustainable mobility remains largely unexplored, and this paper directly addresses that gap.
While prior work has applied multilingual transformers to sentiment tasks, three gaps remain unaddressed in the literature. First, unlike prior transport sentiment studies [13,14,15], which operate exclusively on English corpora, no existing study provides a sentence-aligned multilingual evaluation dataset specifically designed for sustainable transport discourse across multiple European languages. Second, while zero-shot and few-shot transfer have been evaluated on general benchmark datasets [18,19,20], confidence-based diagnostics have not been systematically explored as a structured way to inspect model behavior in domain-specific zero-shot NLP when gold-standard annotations are unavailable. Third, qualitative taxonomies of low-confidence linguistic patterns (useful for guiding pre-processing and future annotation strategies in scarce-resource deployments) remain absent from the transport NLP literature. This work directly addresses all three gaps, building on the English UGSC established in [15] and extending it into a multilingual, cross-lingual evaluation framework.
These gaps point to two open empirical questions that this study directly addresses. The first concerns cross-domain transfer: whether models fine-tuned on general-purpose, informal social media data (e.g., X, formerly Twitter) can produce stable and interpretable confidence patterns when applied to longer, more descriptive, domain-specific transport reviews without task-specific adaptation. The second concerns cross-lingual consistency: whether confidence and polarity distributions remain stable across languages when sentence-aligned translations are used, or whether language-specific structures and translation effects introduce systematic differences. These questions motivate two research hypotheses: H1 predicts that XLM-RoBERTa will classify sentiment with consistent confidence across languages, particularly for clearly polarized content; H2 predicts systematic cross-lingual variations, especially for neutral or mixed statements.
To address these questions, this paper introduces a reproducible zero-shot domain transfer multilingual sentiment evaluation framework built around three contributions: (1) a sentence-aligned multilingual dataset (UGSC-ML) covering English, Spanish, French, German, and Italian, derived from real-world mobility reviews and publicly released on Zenodo [24] and GitHub [25]; (2) a zero-shot domain transfer evaluation pipeline using XLM-RoBERTa [16] applied without task- or domain-specific fine-tuning, applicable to contexts where labeled data are scarce; and (3) a qualitative taxonomy of six linguistic patterns (mixed sentiment, conditional structures, idiomatic expressions, irony, punctuation noise, and translation drift) that are associated with lower model confidence, providing actionable guidance for future pre-processing and fine-tuning strategies.
Our principal findings show that XLM-RoBERTa assigns consistently high confidence to sentiment in polarized content across all five languages (mean confidence 0.76–0.85 for positive/negative classes), indicating stable cross-lingual confidence patterns, a conclusion that should be understood as tentative in the absence of gold-standard annotations for translated data, and that future supervised validation should test directly. The model also assigns lower confidence to neutral or ambiguous inputs (0.58–0.65), which we interpret as a confidence-based indicator of uncertainty rather than as evidence of correctness or calibration. Visible but preliminary cross-linguistic variations are observed, consistent with H2, including higher negative proportions in Italian and cross-lingual differences in neutral rates; these patterns are treated as preliminary findings requiring further linguistic validation. Among the 1875 predictions, 113 low-confidence predictions (6.0%) were categorized into the six recurrent linguistic categories reported in the qualitative taxonomy. Inter-annotator agreement over these categorized cases reached κ = 0.664 (substantial agreement). This taxonomy identifies specific linguistic structures associated with lower model confidence, offering concrete directions for improving zero-shot domain transfer NLP in annotation-scarce, domain-specific deployment scenarios. The remainder of this paper is organized as follows. Section 2 describes the materials and methods, including the construction of the sentence-aligned multilingual dataset (UGSC-ML) and the zero-shot domain transfer evaluation pipeline based on XLM-RoBERTa. Section 3 presents the results, covering overall sentiment distributions and model confidence, language-specific patterns, and a qualitative taxonomy of low-confidence cases. Section 4 then discusses the methodological contributions, the generalizability of the framework, and its limitations. Finally, Section 5 offers concluding remarks and directions for future work.

2. Materials and Methods

2.1. Dataset and Multilingual Translation Pipeline

We base our analysis on the User Gold Standard Corpus (UGSC), a sentence-level dataset of user-generated reviews related to sustainable transport, previously introduced and validated in [15]. The UGSC was built from a large-scale TripAdvisor corpus of 117,458 English-language sentences collected from 2007 to 2020, covering diverse transport modes. A subset of 2000 sentences was manually annotated for sentiment polarity following a rigorous quality-control process: sentences with ambiguous or incorrect TripAdvisor star ratings were manually corrected or discarded, and interannotator agreement was evaluated using Cohen’s Kappa (K = 0.487, moderate agreement), confirming the need and value of the manual annotation effort. Building on this resource, the resulting corpus was used to fine-tune XLM-RoBERTa in [15], achieving F1 above 97% on clean data.
For the present study, we selected 375 sentences from the UGSC following a stratified sampling procedure designed to preserve diversity across source polarity, transport mode, sentence length, and linguistic complexity. The sampling was not intended to produce a statistically representative benchmark of the full TripAdvisor corpus, but rather a controlled, diverse evaluation subset suitable for exploratory zero-shot domain transfer analysis. Because all selected sentences were drawn from the same TripAdvisor source corpus, stratification by review source was not applicable. Since the original UGSC was constructed as a binary gold-standard corpus from 1-star (negative) and 5-star (positive) TripAdvisor reviews, the 375 selected sentences carry binary source polarity labels. The three-class distinction (positive, neutral, negative) in this study arises from the CardiffNLP model output: neutral predictions are therefore model outputs rather than original gold-standard source labels, a distinction discussed further in Section 3.4.
Sentences with unclear wording, corrupted text, or ambiguous polarity at the boundary of adjacent ratings were excluded to avoid introducing annotation noise into the evaluation set. Importantly, the 375 English sentences retain their gold-standard sentiment labels from the original UGSC [15], which was constructed and validated in prior work by the same research group. This means that the English subset of UGSC-ML carries a manually validated binary reference point inherited from the UGSC, providing context for assessing zero-shot model behavior on the source language. However, a direct supervised comparison is not straightforward because the present model produces three-class predictions, including a neutral class absent from the original binary annotation scheme, as discussed in Section 3.4. The absence of gold-standard annotations, therefore, applies specifically to the four translated subsets, an inherent and expected constraint of the cross-lingual zero-shot design.
TripAdvisor was selected as the source platform because it provides structured star-rating metadata alongside free-text reviews, enabling a form of distant supervision: 1-star reviews provide implied negative labels and 5-star reviews provide positive labels, reducing annotation effort while maintaining interpretable quality control. Cohen’s Kappa was chosen as the inter-annotator agreement metric for the original UGSC annotation because it corrects for chance agreement, providing a more conservative and methodologically appropriate measure than raw percentage agreement for categorical annotation tasks. While the original UGSC uses binary sentiment labels, the evaluation in this study extends to three classes (positive, neutral, and negative) through the CardiffNLP model output, rather than through new manual three-class annotation. The sentences were translated into Spanish, French, German, and Italian using a sentence-aligned pipeline: automatic translation tools provided initial drafts, followed by thorough manual review conducted by two independent annotators with expertise in both sentiment analysis and multilingual data, to ensure preservation of sentiment polarity, tone, and structural coherence. Discrepancies were corrected manually, and an independent double-check of 10% of translated sentences yielded a percentage agreement of 91% on sentiment polarity preservation (note: this figure reports raw percentage agreement on the translation review task, not Cohen’s Kappa, as the check focused on binary preservation of the source-language label rather than independent annotation); conflicts were resolved by consensus.
The final multilingual evaluation set (UGSC-ML) comprises 1875 sentence instances across five languages and is openly available in three repositories: the original English UGSC on GitHub [26], the multilingual version with model predictions on Zenodo, version 1.1.0 [24], and associated inference scripts on a dedicated GitHub repository, release v1.0.0 [25]. This open release provides the community with a reusable resource for domain-specific multilingual sentiment evaluation.

2.2. Zero-Shot Sentiment Classification Model

We apply a zero-shot domain transfer classification strategy using the CardiffNLP twitter-xlm-roberta-base-sentiment model [27], a multilingual transformer pre-trained on the CC100 corpus (covering 100+ languages) and fine-tuned on sentiment-labeled Twitter data, which assigns each input one of three polarity classes (positive, neutral, or negative) together with a confidence score for each prediction. Importantly, the term “zero-shot” in this context refers to zero-shot domain transfer: the model is not zero-shot in the general sense but rather applied without any task-specific or domain-specific fine-tuning to transport reviews, making it a realistic evaluation of cross-domain generalization. The model was selected because the objective of this paper is not to benchmark multilingual architectures, but to evaluate a reproducible zero-shot domain transfer framework under annotation-scarce conditions using a public, sentiment-specialized multilingual classifier. Compared with alternative multilingual models such as mBERT [12] and XLM [17], the CardiffNLP model offers three practical advantages for this study: (1) it provides direct three-class sentiment predictions (positive, neutral, negative) without requiring an additional classification head; (2) it covers all five target languages in its pre-training corpus; and (3) its public availability on Hugging Face ensures full reproducibility. Multilingual E5, while relevant as a multilingual embedding model, would require an additional supervised or weakly supervised classifier to make comparable sentiment predictions. A full architecture-level comparison with mBERT, XLM, multilingual E5, and other recent multilingual encoders is therefore treated as a separate experimental question and included as a priority for future work.
We acknowledge an inherent domain mismatch: the model’s fine-tuning data consists of short, informal social media posts, whereas our target domain comprises longer, more descriptive transport reviews. This mismatch is intentionally accepted to evaluate cross-domain as well as cross-lingual generalization in a realistic deployment scenario where domain-specific labeled data are scarce.

2.3. Evaluation Protocol

Given the absence of gold-standard annotations for the translated versions, we adopt a three-component evaluation strategy in which each approach addresses a distinct dimension of model behavior:
  • Distributional coherence analysis: We assess whether predicted sentiment class proportions remain consistent across languages, as significant deviations from the source-language distribution may indicate systematic translation drift or language-specific model biases.
  • Confidence-based evaluation: We analyze average confidence scores per sentiment class and per language as a proxy for model certainty. This metric is particularly valuable in the absence of gold-standard annotations, as it provides an interpretable diagnostic signal of model behavior without requiring labeled target-language data. Low confidence scores, in turn, are interpreted as indicators of potential model uncertainty, typically arising in ambiguous or linguistically complex cases such as those involving irony, idiomatic expressions, or mixed-sentiment structures. Specifically, high confidence in polarized classes and lower confidence in neutral or ambiguous cases are expected and interpretable outcomes.
  • Qualitative taxonomy of low-confidence cases: We categorize low-confidence sentences that can be assigned to a dominant linguistic pattern using heuristic rules to identify recurring linguistic patterns, providing actionable guidance for future pre-processing and fine-tuning efforts.
Together, these three approaches provide a complementary and interpretable picture of zero-shot domain transfer model behavior without requiring labeled target-language data, a methodological choice we revisit and justify empirically in Section 3.4. In practice, the second component, confidence-based evaluation, operates through a continuous interpretation of the softmax output of the classification head as a proxy for prediction certainty: scores above 0.7 are treated as high-confidence predictions, scores between 0.5 and 0.7 as moderate-confidence predictions, and scores below 0.5 as low-confidence predictions. The low-confidence subset then provides the empirical basis for the third component, the qualitative taxonomy of linguistically complex cases described in Section 3.3. Throughout, confidence is treated as a proxy for model certainty, not as evidence of correctness or calibration.

2.4. Implementation Details

Minimal preprocessing was applied to preserve the natural linguistic structure of user-generated content, in line with the design of the SentencePiece tokenizer (Google LLC, Mountain View, CA, USA) used by XLM-RoBERTa: only trailing punctuation artifacts were removed, accented characters were Unicode-normalized, and sentence formatting was standardized. No lemmatization, lowercasing, or stopword removal was performed, retaining the original input form to avoid information loss prior to subword tokenization. This deliberate choice avoids domain-specific stopword filtering or colloquial normalization that could inadvertently remove sentiment-bearing tokens, consistent with the exploratory nature of the evaluation. The zero-shot domain transfer pipeline operates as follows: (1) preprocess input sentence s; (2) tokenize using the SentencePiece tokenizer of M; (3) run inference to obtain logits for three classes [negative, neutral, positive]; (4) apply softmax to obtain confidence distribution; (5) assign predicted class as the argmax; (6) record confidence as the maximum softmax score; (7) if confidence is below 0.5, flag the sentence for qualitative taxonomy analysis. Regarding the experimental pipeline, it was implemented in Python 3.10/3.12 using Hugging Face Transformers v4.40.2 and PyTorch v2.3.0 for model inference, SciPy v1.13.0 for numerical operations, Pandas v2.2.2 and NumPy v1.26.4 for data handling, tqdm v4.66.4 for progress tracking, and Matplotlib v3.9.0 for visualization. The experiments were run in Google Colab. The pipeline can be executed in either CPU or GPU runtimes and also supports Apple Silicon (MPS); GPU acceleration reduces inference time but is not required to reproduce the analysis. The GitHub repository [25] includes: inference_pipeline.py, which implements the zero-shot sentiment classification workflow end-to-end; generate_figures.py, which reproduces the visualization outputs used in the Results section; and a Google Colab notebook (UGSC_ML_inference.ipynb) that runs the complete pipeline (from dataset loading to figure generation) directly in the browser without requiring any local installation. The model downloads automatically from Hugging Face on first run. All code and data are publicly available on Zenodo version 1.1.0 [24] and GitHub release v1.0.0 [25].

3. Results

3.1. Overall Sentiment Distributions and Model Confidence

Table 1 presents the predicted sentiment distributions and mean confidence scores across all five languages. The overall distribution shows a positive tendency in most languages, while the equal number of instances per language subset allows direct cross-language comparison of predicted class distributions. The most consistent finding, however, is the confidence gap between polarity classes: the model assigned substantially higher confidence to positive and negative predictions (mean: 0.76–0.85) than to neutral ones (mean: 0.58–0.65), a pattern that holds across all five languages and is consistent with H1. As established in Section 2.3, confidence is interpreted throughout as a proxy for model certainty, not as a measure of correctness or calibration. This confidence gap suggests that lower confidence is concentrated precisely on cases that are harder to classify, rather than being randomly distributed across classes.
At the language level, French yielded the highest average confidence across classes (overall mean: 0.815), followed by Italian and Spanish (both 0.784), while English showed the lowest overall confidence (0.730), a pattern that should be interpreted cautiously given the exploratory nature of the evaluation. Spanish showed the highest neutral confidence (0.648), while German showed the highest positive confidence (0.850) and the highest positive rate (172/375, 45.9%). All languages maintain the expected confidence gap between polarized and neutral classes, with no language showing an anomalous confidence pattern. Cross-linguistic variation in neutral predictions is discussed in Section 3.2.
Figure 1 provides a visual overview of the overall sentiment class distribution across all languages combined (Negative: 674, 35.95%; Neutral: 295, 15.73%; Positive: 906, 48.32%).

3.2. Language-Specific Sentiment Patterns

Cross-linguistic differences were observed in the distribution of predicted sentiment classes, as shown in Figure 2. Italian showed the highest proportion of negative predictions (152/375, 40.5%), followed by French (147/375, 39.2%) and Spanish (132/375, 35.2%). German showed the lowest negative rate (114/375, 30.4%) and the highest neutral rate (89/375, 23.7%), while French showed the highest positive rate (209/375, 55.7%). English, as the source corpus, showed a balanced distribution with a neutral proportion of 73/375 (19.5%), reflecting the quality-controlled nature of the original annotations. French showed a notably low neutral rate (19/375, 5.1%), suggesting that the model assigns more decisive sentiment labels to French transport reviews. Spanish and German showed the highest neutral confidence scores (0.648 and 0.642, respectively), indicating greater model certainty on neutral predictions in those languages.
Despite these variations, confidence scores for negative predictions remained relatively stable across languages, ranging from 0.760 (English) to 0.809 (French), which is consistent with stable confidence behavior for dissatisfaction-related expressions across linguistic contexts. The language-specific patterns should be interpreted cautiously. Sociolinguistic research [28,29] provides a possible context for interpreting cross-linguistic variation, but the present dataset does not directly demonstrate cultural causes. For example, the higher proportion of negative predictions in Italian may be compatible with broader observations on emotional expressiveness in user-generated content. Cross-lingual differences in neutral proportions are visible, particularly the low neutral rate in French and the higher neutral rate in German; however, these patterns should be treated as preliminary diagnostic findings rather than conclusions confirmed by the data. Together, these findings provide preliminary evidence consistent with H2: cross-linguistic variations are visible in the predicted distribution and should be examined through more systematic linguistic analysis in future work.

3.3. Qualitative Taxonomy of Low-Confidence Cases

To characterize the sources of model uncertainty, we focused on predictions with a confidence score below 0.5 that could be assigned to a dominant linguistic pattern, using a manual coding protocol guided by heuristic classification rules. Six distinct categories emerged, spanning a broad spectrum of linguistic complexity: from pragmatic ambiguity, such as irony and idiomatic expressions where surface form and intended meaning diverge, to structural challenges like conditional constructions and mixed-sentiment statements where polarity signals conflict within the same sentence. The coding was performed by two independent annotators; categories were treated as mutually exclusive, and when a sentence exhibited features of multiple patterns, the dominant pattern was assigned by consensus. Table 2 summarizes the six recurrent linguistic patterns identified in the qualitative taxonomy, with representative examples illustrating the types of cases observed in the dataset.
A total of 113 categorized low-confidence predictions (6.0% of all predictions) were assigned to one of these six categories, as reported in Table 3. Inter-annotator agreement calculated over these categorized cases reached κ = 0.664 (substantial agreement), consistent with the inherent subjectivity of qualitative pattern assignment in exploratory diagnostic taxonomies. Translation Drift was the most frequent pattern (34 cases, 30.1%), reflecting the impact of subtle shifts in polarity or intensity between the English source and its translations on model confidence.
Number of cases is reported per category. Percentages are computed over the 113 categorized low-confidence predictions assigned to the six main linguistic categories. Categories are mutually exclusive; inter-annotator agreement calculated over the 113 categorized cases: κ = 0.664.
Figure 3 shows the mean confidence associated with each pattern. The taxonomy should therefore be interpreted as a structured qualitative diagnostic of low-confidence behavior rather than as a fully supervised error analysis. Mixed Sentiment follows closely (28 cases, 24.8%), where conflicting positive and negative signals within the same sentence produce genuinely ambiguous inputs. Idiomatic Expressions constitute the third most frequent category (24 cases, 21.2%), a notable finding that underscores the difficulty of culturally marked non-literal language for zero-shot models across all five languages, not only Spanish and German. Conditional and Hypothetical structures account for 17 cases (15.0%), reflecting the model’s difficulty with speculative or non-assertive sentiment formulations. Irony and Sarcasm, while infrequent in absolute terms (8 cases, 7.1%), receive the lowest average confidence (0.44), consistent with the fundamental difficulty of detecting inverted polarity in a zero-shot domain transfer setting. Informal Punctuation cases (2, 1.8%) are rare but fall below the threshold, likely reflecting tokenization noise. Across categories, the confidence range is narrow (0.44–0.47), suggesting that these patterns may correspond to consistently difficult inputs for the model rather than to isolated noise artefacts. Overall, the distribution suggests that cross-lingual transfer effects and pragmatic ambiguity are more frequent contributors to low model confidence than syntactic noise alone. These findings offer concrete guidance for practitioners: pre-processing pipelines targeting translation consistency checks, idiom detection across languages, and conditional phrase flagging could meaningfully improve confidence-based diagnostics in domain-specific multilingual transport NLP.

3.4. Limitations of Quantitative Evaluation

Because the translated datasets do not include gold-standard sentiment annotations (an inherent constraint of the zero-shot cross-lingual scenario described in Section 2.3, where labeled target-language data are unavailable by design), traditional supervised metrics (accuracy, F1-score, precision, recall) cannot be computed for non-English texts. This is a deliberate methodological trade-off that motivates the evaluation strategy adopted throughout this section.
It is important to note that the English subset of UGSC-ML carries manually validated binary sentiment labels inherited from the original UGSC [15], providing a source-language reference point for interpreting zero-shot model behavior. These manually validated labels (drawn from TripAdvisor reviews rated 1 star (negative) or 5 stars (positive)) provide a stronger reference point than implicit distant supervision alone, as the English evaluation is grounded in expert-annotated labels from a prior peer-reviewed study, with every sentence carrying an implied binary gold label consistent with the gold-standard annotation. The zero-shot model, however, predicts 73 of 375 English sentences (19.5%) as neutral, a class absent from the original binary annotation scheme. This rate drops sharply among high-confidence predictions: at a confidence threshold of 0.8, only 3 of 151 predictions (2.0%) are neutral, with the remaining 98.0% classified as positive or negative (mean confidence: 0.90 for negative and 0.86 for positive). This pattern suggests that neutral predictions are concentrated in lower-confidence and potentially ambiguous cases, rather than necessarily indicating systematic misclassification. It also explains why direct comparison with the supervised F1 of 97.78% reported in [15] for the fine-tuned binary classifier on the same domain is not straightforward: the evaluation tasks differ in granularity (binary vs. three-class), training regime (fine-tuned vs. zero-shot), and the treatment of ambiguous cases. These differences highlight the complementary nature of the two approaches rather than a contradiction in their results and are discussed further in the context of the broader methodological contributions in Section 4.

4. Discussion

4.1. Methodological Contributions

The results presented in Section 3 demonstrate that zero-shot domain transfer multilingual sentiment analysis is both technically feasible and methodologically tractable in a domain-specific, annotation-scarce setting. Three aspects of this finding merit particular discussion, each addressing a distinct dimension of the contribution.
First, the cross-lingual consistency of confidence patterns, which were high for polarized classes and lower for neutral inputs across all five languages, suggests that XLM-RoBERTa’s multilingual representations can produce stable confidence patterns under domain shift without task-specific adaptation, a conclusion subject to future supervised validation on translated data. This cross-lingual stability is theoretically grounded in the architecture of XLM-RoBERTa [16]: pre-training on the CC100 corpus across 100+ languages using a shared SentencePiece vocabulary encourages the model to develop language-agnostic subword representations, in which semantically similar content across languages occupies overlapping regions of the representation space. This shared representational geometry is what enables sentiment-relevant features learned from Twitter data in one language to transfer, at least partially, to transport reviews in other languages without explicit cross-lingual supervision, a property documented in the cross-lingual transfer literature [16,17,18] and consistent with our observed confidence patterns. This shared representational geometry suggests that zero-shot domain transfer pipelines can serve as a viable first-pass diagnostic solution in domains where labeled multilingual data are difficult or costly to obtain.
Second, the UGSC-ML dataset fills a concrete gap in the NLP evaluation landscape. Domain-specific, quality-controlled multilingual benchmarks remain scarce, and the availability of a publicly released resource covering five languages, together with three-class model predictions, provides a reusable resource for evaluating future multilingual models in applied settings.
Third, the confidence-based evaluation strategy addresses a structural challenge in zero-shot NLP: how to diagnose model certainty patterns without ground-truth labels in target languages. By treating confidence distributions as a diagnostic signal rather than a performance proxy, this approach enables interpretable model assessment before deployment, an increasingly important consideration as NLP systems are integrated into operational contexts with real consequences. The empirical results presented in Section 3 support all three contributions, as evidenced by the hypothesis validation. H1 is consistent with the high confidence observed for polarized content across all five languages (mean 0.76–0.85 for negative/positive, vs. 0.58–0.65 for neutral), while H2 is reflected in visible but preliminary cross-linguistic variations, notably Italian’s higher negative proportion (152/375, 40.5%), French’s strikingly low neutral rate (19/375, 5.1%), and German’s highest positive confidence (0.850). The qualitative taxonomy further reveals that cross-lingual transfer effects (Translation Drift, 30.1%) and pragmatic ambiguity patterns (Mixed Sentiment 24.8%, Idiomatic Expressions 21.2%) together account for more than three-quarters of categorized low-confidence cases, pointing to specific linguistic phenomena that future pre-processing and fine-tuning strategies should target. These language-specific patterns and taxonomy findings should be interpreted as preliminary diagnostic results requiring further linguistic validation, not as demonstrated cultural or causal explanations.

4.2. Generalizability of the Framework and Domain Suitability

Sustainable urban mobility constitutes a particularly demanding testbed for zero-shot multilingual NLP, combining three characteristics that frequently co-occur in real-world AI deployment scenarios: (1) linguistic diversity, as urban transport systems serve heterogeneous populations whose feedback spans multiple languages; (2) domain specificity, as transport reviews contain technical vocabulary, service-specific idioms, and evaluative structures that diverge from the informal social media text on which most sentiment models are fine-tuned; and (3) annotation scarcity, as the cost of producing gold-standard multilingual labeled data in specialized domains is prohibitive at scale. The framework proposed here directly addresses this combination, showing that a pre-trained multilingual transformer can provide interpretable confidence-scored predictions under these constraints without any task- or language-specific adaptation, a finding with practical relevance for low-resource multilingual deployment.
Crucially, this three-characteristic profile is not unique to mobility. It recurs across applied AI domains, including healthcare feedback analysis, legal document review, financial opinion mining, and public-sector service evaluation, all contexts where labeled multilingual data are sparse, domain vocabulary is specialized, and annotation efforts are not feasible at deployment scale. In all these settings, the evaluation protocol and qualitative taxonomy introduced in this work may be adaptable—a hypothesis that the present study supports at the methodological level, though domain-specific empirical validation in each target setting would be required before deployment. The confidence-based assessment strategy may serve as a model-agnostic diagnostic tool in zero-shot domain transfer NLP deployments where supervised validation is unavailable, and the six-category linguistic taxonomy provides a structured basis for designing targeted pre-processing pipelines across languages and domains. The UGSC-ML dataset and code, publicly released alongside this paper, are designed to remain a reusable evaluation resource as future multilingual models are developed and compared in domain-specific settings.

4.3. Limitations and Future Directions

Several limitations must be acknowledged. The most important limitation is the absence of gold-standard annotations in target languages, which prevents supervised evaluation using accuracy, precision, recall, F1-score, and calibration metrics. Consequently, confidence scores in this study should be interpreted as diagnostic indicators of model certainty, not as evidence of correctness or validity. The domain mismatch between Twitter-based fine-tuning and transport reviews also affects model behavior on complex or balanced expressions, and translation drift, despite rigorous manual review, may subtly alter sentiment polarity in some instances. Finally, the dataset size, while adequate for exploratory purposes, remains modest by current NLP standards.
These limitations point to concrete directions for future work. The most impactful near-term step would be the creation of a manually annotated multilingual gold-standard subset, for example, 50–100 independently annotated sentences per target language, enabling direct computation of accuracy, precision, recall, F1-score, inter-annotator agreement, and calibration diagnostics. A second priority would be a controlled comparison across multilingual architectures, including mBERT, XLM, multilingual E5-based classifiers, and more recent multilingual encoders. Domain-adaptive fine-tuning using transport-specific multilingual data would then directly address both the domain mismatch and the evaluation gap identified in Section 3.4. A natural extension is Aspect-Based Sentiment Analysis (ABSA), which would enable fine-grained diagnostics of specific service attributes (punctuality, comfort, accessibility, cost) rather than document-level polarity. Longer-term, integrating geospatial and temporal metadata could support real-time spatially resolved sentiment monitoring, while crowd-sourced multilingual annotation would enable fully supervised cross-lingual evaluation and direct comparison across model architectures. Taken together, these directions outline a research agenda toward more robust, fine-grained, and fully supervised multilingual sentiment analysis in domain-specific deployment contexts.

5. Conclusions

This study demonstrates that zero-shot domain transfer for multilingual sentiment analysis, applied through XLM-RoBERTa, is a technically viable and methodologically tractable approach for the domain-specific, annotation-scarce conditions characteristic of sustainable urban mobility. Using a quality-controlled English corpus with sentence-aligned translations across five languages, we evaluated cross-lingual behavior without any task-specific adaptation, finding consistent and interpretable confidence patterns: high confidence for polarized sentiment (mean 0.76–0.85), lower confidence for neutral or ambiguous cases (mean 0.58–0.65), and visible cross-linguistic variations that require further linguistic validation.
Beyond the empirical results, this work introduces three reusable contributions for the AI research community: a publicly released multilingual evaluation resource (UGSC-ML), a confidence-based evaluation protocol designed for zero-shot NLP deployment settings where ground-truth labels are unavailable, and a qualitative taxonomy of six linguistic patterns—translation drift (30.1%), mixed sentiment (24.8%), idiomatic expressions (21.2%), conditional structures (15.0%), irony and sarcasm (7.1%), and informal punctuation (1.8%)—that provides actionable guidance for pre-processing and fine-tuning in domain-specific multilingual contexts, with inter-annotator agreement of κ = 0.664 confirming the reliability of the coding scheme. Together, these contributions are designed to support reproducible and interpretable multilingual NLP beyond the specific domain studied here.
More broadly, the framework developed here may be adaptable to applied AI settings combining linguistic diversity, domain specificity, and annotation scarcity—conditions that recur across healthcare, legal, financial, and public-sector NLP. Domain-specific empirical validation would be required before deployment in each new setting. At the same time, the ability to diagnose model certainty patterns without supervised validation is becoming increasingly important as large multilingual models are integrated into operational systems with real-world consequences. The methodological tools introduced in this paper offer a principled step in that direction and a foundation for future work on robust, interpretable multilingual NLP in resource-constrained settings.

Author Contributions

Conceptualization, A.S. and J.K.G.; methodology, A.S. and J.K.G.; software, A.S.; validation, A.S., J.K.G. and J.d.O.; formal analysis, A.S.; investigation, A.S. and J.d.O.; resources, A.S. and J.d.O.; data curation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, A.S. and J.K.G.; visualization, A.S.; supervision, J.K.G. and J.d.O.; project administration, J.d.O.; funding acquisition, J.d.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research is part of the Research Project PARTICIPA-IA (AIA2025-163553-C42) funded by MICIU/AEI/10.13039/501100011033/ (Spanish Ministry of Science, Innovation and Universities/Agencia Estatal de Investigación (AEI), grant number AIA2025-163553-C42). The APC was funded by the AEI through the same grant.

Institutional Review Board Statement

Not applicable. This study used publicly available UGC and did not involve human subjects. No IRB approval was required.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original English UGSC dataset is publicly available on GitHub [26]. The UGSC-ML multilingual sentiment dataset, model prediction outputs, and low-confidence annotation files are publicly available on Zenodo [24]. The associated inference scripts, figure-generation code, requirements file, and Google Colab notebook are publicly available in the GitHub repository [25].

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
NLPNatural Language Processing
UGCUser-Generated Content
UGSCUser Gold Standard Corpus
UGSC-MLUser Gold Standard Corpus Multilingual
CLSACross-Lingual Sentiment Analysis
XLM-RoBERTaCross-lingual Language Model Robustly Optimized BERT Pretraining Approach
mBERTMultilingual Bidirectional Encoder Representations from Transformers
XLMCross-lingual Language Model Pretraining
ABSAAspect-Based Sentiment Analysis
GDPRGeneral Data Protection Regulation
SDGsSustainable Development Goals

References

  1. Gudmundsson, H.; Marsden, G.; Zietsman, J. Sustainable Transportation: Indicators, Frameworks, and Performance Management; Springer: Cham, Switzerland, 2016. [Google Scholar]
  2. European Commission. Sustainable and Smart Mobility Strategy—Putting European Transport on Track for the Future. 2020. Available online: https://transport.ec.europa.eu (accessed on 17 September 2025).
  3. United Nations. Transforming Our World: The 2030 Agenda for Sustainable Development. 2015. Available online: https://sdgs.un.org/goals (accessed on 26 April 2025).
  4. Gobierno Vasco. Estrategia de Movilidad Sostenible de Euskadi 2030. 2025. Available online: https://www.euskadi.eus/plan-director-del-transporte-sostenible/web01-a2kudeak/es/ (accessed on 29 April 2025).
  5. Castillo, H.; Pitfield, D.E. ELASTIC—A Methodological Framework for Identifying and Selecting Sustainable Transport Indicators. Transp. Res. Part D Transp. Environ. 2010, 15, 179–188. [Google Scholar] [CrossRef]
  6. Gitto, S.; Mancuso, P. Brand perceptions of airports using social networks. J. Air Transp. Manag. 2019, 75, 153–163. [Google Scholar] [CrossRef]
  7. Grant-Muller, S.M.; Gal-Tzur, A.; Minkov, E.; Nocera, S.; Kuflik, T.; Shoor, I. Enhancing transport data collection through social media sources: Methods, challenges and opportunities for textual data. IET Intell. Transp. Syst. 2015, 9, 407–417. [Google Scholar] [CrossRef]
  8. Joshi, P.; Santy, S.; Budhiraja, A.; Bali, K.; Choudhury, M. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6282–6293. [Google Scholar] [CrossRef]
  9. Liu, B. Sentiment Analysis and Opinion Mining; Morgan & Claypool Publishers: San Rafael, CA, USA, 2012; Volume 5. [Google Scholar]
  10. Cambria, E.; Schuller, B.; Xia, Y.; Havasi, C. New Avenues in Opinion Mining and Sentiment Analysis. IEEE Intell. Syst. 2013, 28, 15–21. [Google Scholar] [CrossRef]
  11. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  12. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
  13. Collins, C.; Hasan, S.; Ukkusuri, S.V. A novel transit rider satisfaction metric: Rider sentiments measured from online social media data. J. Public Transp. 2013, 16, 21–45. [Google Scholar] [CrossRef]
  14. Hadiuzzman, M.; Das, T.; Hasnat, M.M.; Hossain, S.; Musabbir, S.R. Structural equation modeling of user satisfaction of bus transit service quality based on stated preferences and latent variables. Transp. Plan. Technol. 2017, 40, 257–277. [Google Scholar] [CrossRef]
  15. Serna, A.; Soroa, A.; Agerri, R. Applying Deep Learning Techniques for Sentiment Analysis to Assess Sustainable Transport. Sustainability 2021, 13, 2397. [Google Scholar] [CrossRef]
  16. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzman, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-Lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
  17. Conneau, A.; Lample, G. Cross-Lingual Language Model Pretraining. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Curran Associates Inc.: Red Hook, NY, USA, 2019; pp. 7057–7067. [Google Scholar]
  18. Barbieri, F.; Espinosa Anke, L.; Camacho-Collados, J. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1644–1650. [Google Scholar] [CrossRef]
  19. Barnes, J.; Klinger, R.; Schulte im Walde, S. Projecting Embeddings for Domain Adaptation: Joint Modeling of Sentiment Analysis in Diverse Domains. In Proceedings of the 27th International Conference on Computational Linguistics (COLING), Santa Fe, NM, USA, 20–26 August 2018; pp. 818–829. [Google Scholar]
  20. Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; Johnson, M. XTREME: A Massively Multilingual Multi-Task Benchmark for Evaluating Cross-Lingual Generalization. In Proceedings of the 37th International Conference on Machine Learning (ICML), Virtual Event, 13–18 July 2020; pp. 4411–4421. [Google Scholar]
  21. FitzGerald, J.; Hench, C.; Peris, C.; Mackie, S.; Rottmann, K.; Sanchez, A.; Natarajan, P. MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL), Toronto, ON, Canada, 9–14 July 2023; pp. 4277–4302. [Google Scholar] [CrossRef]
  22. Fan, A.; Bhosale, S.; Schwenk, H.; Ma, Z.; El-Kishky, A.; Goyal, S.; Joulin, A. Beyond English-Centric Multilingual Machine Translation. J. Mach. Learn. Res. 2021, 22, 4839–4886. [Google Scholar]
  23. NLLB Team. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv 2022, arXiv:2207.04672. [Google Scholar] [CrossRef]
  24. Serna, A. UGSC Multilingual Sentiment Dataset for Sustainable Mobility. Zenodo. 2026. Available online: https://zenodo.org/records/15085521 (accessed on 10 April 2026).
  25. Serna, A.; Gerrikagoitia, J.K. UGSC Multilingual Sentiment: Code and Resources. 2026. Available online: https://github.com/ainhoaserna/UGSC-multilingual-sentiment (accessed on 10 April 2026).
  26. Agerri, R.; Soroa, A.; Serna, A. Sustainable Transport Sentiment Corpus. 2021. Available online: https://github.com/ixa-ehu/sustainable-transport-sentiment-corpus (accessed on 21 September 2025).
  27. Barbieri, F.; Espinosa Anke, L.; Camacho-Collados, J. CardiffNLP Twitter XLM-RoBERTa Base Sentiment Model. Hugging Face 2022. Available online: https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment (accessed on 29 April 2025).
  28. Hall, E.T. Beyond Culture; Anchor Books: New York, NY, USA, 1976. [Google Scholar]
  29. Hofstede, G. Culture’s Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations, 2nd ed.; Sage: Thousand Oaks, CA, USA, 2001. [Google Scholar]
Figure 1. Overall sentiment class distribution (n = 1875 predictions).
Figure 1. Overall sentiment class distribution (n = 1875 predictions).
Ai 07 00216 g001
Figure 2. Predicted sentiment class distribution per language (n = 375 per language).
Figure 2. Predicted sentiment class distribution per language (n = 375 per language).
Ai 07 00216 g002
Figure 3. Distribution and mean confidence of low-confidence cases by linguistic pattern (n = 113).
Figure 3. Distribution and mean confidence of low-confidence cases by linguistic pattern (n = 113).
Ai 07 00216 g003
Table 1. Predicted sentiment distributions and mean confidence by language.
Table 1. Predicted sentiment distributions and mean confidence by language.
LanguageSentimentPrediction CountMean ConfidenceHypothesis Support
EnglishNegative1290.760H1 ✓
Neutral730.580H1 ✓
Positive1730.770H1 ✓
FrenchNegative1470.809H1 ✓
Neutral190.576H1 ✓
Positive2090.841H1 ✓
GermanNegative1140.762H1 ✓
Neutral890.642H1 ✓
Positive1720.850H1 ✓
ItalianNegative1520.785H2—higher neg
Neutral430.587H1 ✓
Positive1800.831H1 ✓
SpanishNegative1320.775H1 ✓
Neutral710.648H1 ✓
Positive1720.848H1 ✓
Note: The symbol “✓” indicates that the observed confidence pattern is consistent with the corresponding hypothesis. “H1 ✓” denotes support for H1, namely that the model assigns higher mean confidence to polarized sentiment classes than to neutral predictions within the same language. “H2—higher neg” indicates the language-specific pattern supporting H2, namely the higher proportion of negative predictions observed for Italian.
Table 2. Taxonomy of low-confidence linguistic patterns.
Table 2. Taxonomy of low-confidence linguistic patterns.
PatternExampleObserved IssueEffect on Model
Mixed Sentiment“Cheap but always late.”Conflicting polarity signalsLow confidence; unstable label
Conditional/
Hypothetical
“It would be great if buses arrived on time.”Speculative, non-assertive sentimentTends to neutral; low certainty
Idiomatic (ES)“Me dejó frío.”Cultural idiom for disappointmentMisinterpreted as neutral/positive
Idiomatic (DE)“Nicht der Rede wert.”Implicit dissatisfaction masked by idiomMisclassified or uncertain
Irony/Sarcasm“Just what I needed—another delayed train.”Surface polarity inverts actual sentimentWrong polarity; low confidence
Informal Punctuation“Always late,,,, again.”Repeated punctuation introduces noiseLow token-level alignment
Translation Drift (FR)EN: “The ride was okay.” → FR: “Le trajet était agréable.”Slight positive shift in translationPolarity misalignment
Table 3. Low-confidence case distribution by linguistic pattern (n = 113; κ  =  0.664).
Table 3. Low-confidence case distribution by linguistic pattern (n = 113; κ  =  0.664).
PatternN Cases% of Categorized Low-Conf.Mean Confidence
Irony/Sarcasm87.1%0.456
Idiomatic Expressions (ES/DE)2421.2%0.447
Conditional/Hypothetical1715.0%0.458
Mixed Sentiment2824.8%0.460
Informal Punctuation21.8%0.469
Translation Drift3430.1%0.454
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Serna, A.; Gerrikagoitia, J.K.; de Oña, J. Cross-Lingual Sentiment Classification in Sustainable Mobility: A Zero-Shot Domain Transfer Evaluation Framework. AI 2026, 7, 216. https://doi.org/10.3390/ai7060216

AMA Style

Serna A, Gerrikagoitia JK, de Oña J. Cross-Lingual Sentiment Classification in Sustainable Mobility: A Zero-Shot Domain Transfer Evaluation Framework. AI. 2026; 7(6):216. https://doi.org/10.3390/ai7060216

Chicago/Turabian Style

Serna, Ainhoa, Jon Kepa Gerrikagoitia, and Juan de Oña. 2026. "Cross-Lingual Sentiment Classification in Sustainable Mobility: A Zero-Shot Domain Transfer Evaluation Framework" AI 7, no. 6: 216. https://doi.org/10.3390/ai7060216

APA Style

Serna, A., Gerrikagoitia, J. K., & de Oña, J. (2026). Cross-Lingual Sentiment Classification in Sustainable Mobility: A Zero-Shot Domain Transfer Evaluation Framework. AI, 7(6), 216. https://doi.org/10.3390/ai7060216

Article Metrics

Back to TopTop