1. Introduction
In the field of machine learning, the collection of large, high-quality datasets is crucial for the training and implementation of an effective machine learning model. Low-quality datasets, which are small and have an imbalanced distribution of data, can negatively influence the performance and reliability of a machine learning model by introducing a high risk of complex generalization challenges and unwanted bias toward specific classes within the model. However, due to the high cost of the time-consuming process, obtaining a sizable, high-quality dataset or a balanced dataset for classification is a challenging task. To address these challenges, data augmentation has been proposed as a solution.
In data augmentation, the training data is artificially increased [
1] or, alternatively, the proportion of a minority class in an imbalanced dataset is increased [
2]. However, the data must be inflated with a reasonable modification. For instance, in an image-based dataset, if an image recognition model were trained on a dataset that only includes images of humans facing right, the model would only recognize humans who were facing right. To solve this issue, the dataset can be augmented to include several images of humans facing different directions by cropping and rotating the existing dataset, and these new images can then be used to train the model to recognize images humans facing many directions, which is a reasonable way to inflate the dataset. For text-based datasets, data can be augmented using several methods, for example, changing the order of the words in texts, replacing words with synonyms, translating texts, or summarizing texts.
The importance of having a large training dataset cannot be underestimated in the field of machine learning. Using a larger and higher-quality training dataset can lead to a more reliable machine learning model, which can make more accurate predictions or classifications. On the other hand, using a small training dataset makes it difficult to train a reliable machine learning model [
3] because small data samples are insufficient to represent all possible data values for regression or classification tasks. Therefore, a decision made by a model trained by a smaller dataset comes with significant uncertainties [
4]. Moreover, using a small dataset increases the risk of overfitting the model, which, in turn, leads to poor generalization capabilities. A model that is overfit will have high accuracy when tested against the training data but will exhibit low accuracy when tested against test data or new data [
5]. Therefore, it is crucial to utilize large training datasets to develop reliable and accurate machine learning models.
In a classification task, obtaining a balanced dataset distribution is crucial to ensuring that each class has a roughly equal amount of data. A balanced distribution in the dataset leads to a model that is not biased toward any particular class. However, obtaining a balanced dataset can be challenging, as it is difficult to find samples for specific classes, which can result in imbalanced datasets. Imbalanced datasets pose a challenge for classification algorithms, since minority classes in imbalanced datasets tend to have higher misclassification costs than the majority classes [
6], which means that models trained using imbalanced datasets frequently show bias towards majority classes while being unbiased towards minority classes [
7]. This results in a lower-performing model.
Although a challenging problem, finding high-quality datasets in sentiment analysis and ensuring a large and balanced distribution of different sentiment classes within the dataset is crucial for accurate and unbiased analysis. For example, in the business market, customer sentiment can significantly affect other customers’ intentions [
8]. Therefore, it is essential to understand the customers’ opinions to develop the market. However, with either a small or unbalanced dataset, it can be challenging to understand the customer due to a lack of information. As a result, a dataset used to train a sentiment analysis model must be of high quality [
9].
In this study, the effect of data augmentation is investigated to determine its impact on the performance of sentiment analysis models on three distinct language datasets: French, German, and Japanese. Previous scholars [
2,
10,
11,
12] in the data augmentation field primarily focused on sentiment analysis within single languages—mostly in English—and overlooked sentiment analysis across several languages. Sentiment analysis across various languages should be considered, as each language has its own unique structures, alphabet, and grammar. Additionally, it is relatively easy to find text-based datasets in English, since English is one of the most widely used languages. On the other hand, it can be challenging to obtain datasets in languages that are not commonly used. Therefore, data augmentation in various languages is crucial to solving this issue.
Translation-based augmentation was employed to expand the original French, German, and Japanese datasets by translating them into English through various intermediate languages, aiming to increase linguistic diversity. While Abonizio et al. [
13] evaluated several text augmentation techniques, including back-translation, our study builds on this by systematically comparing Google Translate and DeepL for multilingual sentiment analysis across French, German, and Japanese datasets.
To assess generalization, models trained on one language were tested on datasets from other languages. This cross-lingual testing evaluates whether learned patterns can transfer across linguistic boundaries, which is especially helpful when labeled data is limited. Real-world uses include multilingual sentiment analysis in global business and market research.
This study aims to answer the following research questions:
RQ1: Does translation-based data augmentation improve sentiment classification in low-resource languages (French, German, and Japanese)?
RQ2: Can sentiment models trained on augmented data generalize across languages?
We propose the following hypotheses:
Hypothesis 1. Translation-based augmentation will enhance sentiment classification performance in all three languages.
Hypothesis 2. Models trained on augmented data will generalize across languages, but the level of improvement will vary due to linguistic and semantic differences.
Key contributions of this study include the following:
A systematic comparison of Google Translate and DeepL for translation-based augmentation in three typologically diverse languages;
A quantitative evaluation using Support Vector Machine (SVM) classifiers to measure classification performance and cross-lingual generalization;
Insights into how language structure and translation quality influence augmentation effectiveness.
The remainder of this paper is organized as follows:
Section 2 reviews related work,
Section 3 describes the methodology and translation-based augmentation process,
Section 4 presents experimental results and analysis,
Section 5 concludes the study, and
Section 6 outlines limitations and future research directions.
3. Methodology
The methodology process demonstrates the progression from dataset preparation to the analysis and generalization of the models, as shown in
Figure 1. After the dataset preparation, the datasets were subjected to a preprocessing step, including data cleaning, feature selection, and stopword removal. For compatibility reasons, the English stopword list from scikit-learn was applied to all datasets, including German and Japanese. Although this may not remove all non-informative words in non-English datasets, preliminary testing showed minimal performance differences when stopword removal was omitted entirely. Following this process, the sentiment analysis model was trained using the SVM machine learning algorithm, and its performance was evaluated using an evaluation matrix. Lastly, the model’s generalization capabilities were assessed.
3.1. Description of Datasets
In this study, online shopping review datasets were chosen to implement sentiment analysis models with data augmentation techniques. The datasets were collected from the Multilingual Amazon Reviews Corpus, Amazon AWS. They include Amazon product reviews from the USA, Japan, Germany, France, Spain, and China written in English, Japanese, German, French, Spanish, and Chinese languages between 1 November 2015 and 1 November 2019. The dataset includes several features: reviewer ID, review text, review title, rating (on a scale of 1 to 5), product ID, and product category. There are numerous reviews in several product categories. For this study, reviews in the ‘beauty’ product category were selected. It is essential to choose a specific product category, as customer reviews can vary depending on the product type. For instance, in the technology category, a review might be “This is very fast and easy to set up”, whereas in the shoe category, a review might be “They are very comfortable when worn”. Thus, focusing on a specific product category allows us to gain unique insights and expressions from customer reviews.
Several language datasets, i.e., French, German, and Japanese, were used to implement the sentiment analysis model with data augmentation techniques. These languages were chosen due to their different language systems. French is part of the Romance language family, and Latin had a direct influence on it. German is from the Germanic language family, which includes English [
6]. Unlike other languages, Japanese is an isolated language that does not belong to any language family [
6]. Japanese uses three writing scripts—Kanji, Hiragana, and Katakana—while German and French use the Latin alphabet like English [
6]. These three languages use different word orders. Japanese uses the subject–object–verb word order, and German and French use the subject–verb–object word order, more like English [
6]. Furthermore, these languages differ in various aspects, such as culture, nouns, and grammar.
All reviews were rated on a scale of 1 to 5. To see the distribution of positive and negative reviews, a binary variable was created. A review with a rating above three was grouped as a positive review, and a review with a rating below three was grouped as a negative review. Ratings of three were considered neutral and were removed because a neutral rating is neither good nor bad and does not affect any positive or negative points.
Figure 2 provides a visual representation of the distribution of the positive and negative reviews within three datasets. Although there is a slight difference between positive and negative distributions in the three figures, they are in a balanced distribution condition. To support this visual representation,
Table 1 provides the number of positive and negative reviews in each of the three datasets. Since the datasets are of a small size, it will be challenging for the sentiment analysis model to capture the critical characteristics from these datasets, thereby achieving optimal and accurate prediction performance.
The raw datasets were much larger and needed filtering to build balanced subsets for binary classification. The original datasets included 679,077 German, 262,397 Japanese, and 254,044 French reviews. These large, skewed datasets were filtered to create balanced sets of positive and negative reviews suitable for binary sentiment classification. After filtering the raw datasets to create balanced binary classification subsets, translation-based data augmentation was applied using Google Translate and DeepL. This sequence ensured that augmentation was performed only on balanced datasets, preserving the 1:1 ratio of positive to negative reviews in the final augmented sets. Crucially, the class balance was preserved after augmentation, with roughly equal numbers of positive and negative samples in each dataset. The purpose of augmentation in this work was not just to enlarge the dataset but to increase data diversity and balance, thereby boosting model robustness and generalization. All text was converted to lowercase, and punctuation was removed. The text was then tokenized using CountVectorizer from scikit-learn. For the German dataset, stopwords were removed using the English stopword list in CountVectorizer (stop_words=‘english’). No stemming or lemmatization was applied in order to preserve the original linguistic features of each language. Finally, the cleaned text was transformed into feature vectors for model training.
4. Results
In this section, the findings of the study are presented with respect to the investigation of the performance of sentiment analysis with translation data augmentation in three languages.
In the NLP field, the evaluation and comparison of machine learning models are important to determine the effectiveness of the model, and evaluation metrics are commonly used consisting of accuracy, F score, precision, and recall [
16,
26]. Accuracy provides a measurement of the correct classification ability of the model [
16]. For each language, the performance of machine learning models with data augmentation was assessed by the evaluation metrics. A confusion matrix is a table used to measure classification in terms of predicted and actual classes, as shown in
Table 3, and is the basis for calculating accuracy, precision, recall, and F score.
Accuracy (Equation (
1)) provides a measure of how often the classifier is correct overall. Precision (Equation (
2)) provides a measure of how often the classifier is correct when it predicts ‘yes’. Recall (Equation (
3)) provides the measure of how often the classifier predicts ‘yes’ the correct classification is actually ‘yes.’ F score (Equation (
4)) provides a measure the accuracy of a binary classification by using precision and recall values.
4.3. Discussion of Results
The results of this study show that the effectiveness of translation-based data augmentation for sentiment analysis varies greatly depending on the language and the machine translation (MT) service used. Notably, Google Translate consistently outperformed DeepL in improving model accuracy for French and Japanese datasets, while neither tool significantly enhanced performance for the German dataset. These differences can be linked to several interconnected factors.
First, language structure is essential. Japanese has a subject–object–verb (SOV) sentence structure and uses logographic scripts (Kanji) and syllabaries (Hiragana and Katakana), which make it linguistically different from English and German. Google Translate, benefiting from extensive training on various language pairs, may handle this linguistic complexity better than DeepL, which is optimized for European languages. Conversely, German, although structurally closer to English (subject–verb–object), includes compound nouns, case-inflected grammar, and rigid syntax, which can lead to more translation errors, especially when using intermediate languages during augmentation. These translation inconsistencies can reduce the quality of augmented data, particularly for sentiment-heavy expressions. In this context, choosing to use intermediate translation was intentional because it can boost linguistic diversity in low-resource and cross-lingual settings, thereby enriching the training data. However, this method naturally involves risks like translation noise, semantic drift, and quality issues, especially when multiple intermediate languages are used. These factors may explain the variability seen in our results. Future research will look into strategies such as adding selective human validation or spot-checking translated samples to maintain sentiment accuracy, along with automated quality metrics to measure translation consistency.
Second, the variability in translation quality across different MT tools likely affected the results. Google Translate is known for its broader language coverage and extensive training data, especially for non-European languages like Japanese, while DeepL is often considered better for translating European languages but may have less support for Japanese or less common translation pathways. Additionally, choosing intermediate languages during augmentation can introduce cascading translation errors, which tend to impact languages with less lexical overlap or different grammatical structures, such as German and Japanese.
Third, domain-specific vocabulary and dataset size may also have played a role. The dataset, based on Amazon beauty product reviews, may include informal or colloquial language that DeepL, often trained on formal text corpora, handles less effectively. Google Translate’s exposure to a wider range of text genres could explain its better performance in capturing contextual sentiment cues.
Overall, these findings show that translation-based augmentation is not always effective and that selecting the right translation tools and understanding language features are important. Future work should include translation quality metrics (e.g., BLEU and semantic similarity) and examine language pair-specific tuning to improve augmentation results. Evaluation of Research Questions and Hypotheses: The findings address RQ1, which asked if translation-based data augmentation improves sentiment classification in low-resource languages. Results show Google Translate augmentation improved French and Japanese, while DeepL had mixed outcomes and neither significantly improved German. Hypothesis 1 is partly supported: augmentation can enhance accuracy in some languages, depending on the language.
For RQ2, which examined if models trained on augmented data can generalize across languages, cross-lingual evaluation showed some improvement for distant pairs (e.g., Japanese → French), but gains were inconsistent and often reduced when languages had different grammatical structures. This partially supports Hypothesis 2, indicating that generalization is possible but not across all language pairs.