MDPI - Publisher of Open Access Journals

47 pages, 7087 KB

Open AccessArticle

Do Stop Words Matter in Bug Report Analysis? Empirical Findings Using Deep Learning Models Across Duplicate, Severity, and Priority Classification

by Jinfeng Ji and Geunseok Yang

Appl. Sci. 2025, 15(16), 9178; https://doi.org/10.3390/app15169178 - 20 Aug 2025

Viewed by 123

Abstract

As software systems continue to increase in complexity and scale, the number of reported bugs also grows. Bug reports are essential artifacts in software maintenance, supporting critical tasks such as detecting duplicate reports, predicting bug severity, and assigning priority levels. Although stop word [...] Read more.

As software systems continue to increase in complexity and scale, the number of reported bugs also grows. Bug reports are essential artifacts in software maintenance, supporting critical tasks such as detecting duplicate reports, predicting bug severity, and assigning priority levels. Although stop word removal is a common text preprocessing step in natural language processing, its effectiveness in deep learning-based bug report analysis has not been thoroughly evaluated. This study investigates the impact of stop word removal on three core bug report classification tasks. The analysis uses a dataset containing over 1.9 million bug reports from eight large-scale open-source projects, including Eclipse, FreeBSD, GCC, Gentoo, Kernel, RedHat, Sourceware, and WebKit. Five deep learning models are applied: convolutional neural networks, long short-term memory networks, gated recurrent units, Transformers, and BERT. Each model is evaluated on its performance with and without stop word removal during preprocessing. The results show that the F1 score difference was less than 0.01 in over 85% of comparisons, so stop word removal has little to no effect on predictive performance in eight open-source projects. Average F1-scores remain consistent across all tasks and models, with 0.36 for duplicate detection, 0.33 for severity prediction, and 0.33 for priority prediction. Statistical significance tests confirm that the observed differences are not meaningful across datasets or model types. The findings suggest that stop word removal is not necessary in deep learning-based bug report analysis. Removing this step may simplify preprocessing pipelines without reducing accuracy, particularly in large-scale and real-world software engineering applications. Full article

► Show Figures

Figure 1

23 pages, 978 KB

Open AccessArticle

Emotional Analysis in a Morphologically Rich Language: Enhancing Machine Learning with Psychological Feature Lexicons

by Ron Keinan, Efraim Margalit and Dan Bouhnik

Electronics 2025, 14(15), 3067; https://doi.org/10.3390/electronics14153067 - 31 Jul 2025

Viewed by 389

Abstract

This paper explores emotional analysis in Hebrew texts, focusing on improving machine learning techniques for depression detection by integrating psychological feature lexicons. Hebrew’s complex morphology makes emotional analysis challenging, and this study seeks to address that by combining traditional machine learning methods with [...] Read more.

This paper explores emotional analysis in Hebrew texts, focusing on improving machine learning techniques for depression detection by integrating psychological feature lexicons. Hebrew’s complex morphology makes emotional analysis challenging, and this study seeks to address that by combining traditional machine learning methods with sentiment lexicons. The dataset consists of over 350,000 posts from 25,000 users on the health-focused social network “Camoni” from 2010 to 2021. Various machine learning models—SVM, Random Forest, Logistic Regression, and Multi-Layer Perceptron—were used, alongside ensemble techniques like Bagging, Boosting, and Stacking. TF-IDF was applied for feature selection, with word and character n-grams, and pre-processing steps like punctuation removal, stop word elimination, and lemmatization were performed to handle Hebrew’s linguistic complexity. The models were enriched with sentiment lexicons curated by professional psychologists. The study demonstrates that integrating sentiment lexicons significantly improves classification accuracy. Specific lexicons—such as those for negative and positive emojis, hostile words, anxiety words, and no-trust words—were particularly effective in enhancing model performance. Our best model classified depression with an accuracy of 84.1%. These findings offer insights into depression detection, suggesting that practitioners in mental health and social work can improve their machine learning models for detecting depression in online discourse by incorporating emotion-based lexicons. The societal impact of this work lies in its potential to improve the detection of depression in online Hebrew discourse, offering more accurate and efficient methods for mental health interventions in online communities. Full article

(This article belongs to the Special Issue Techniques and Applications of Multimodal Data Fusion)

► Show Figures

Figure 1

21 pages, 804 KB

Open AccessArticle

Spam Email Detection Using Long Short-Term Memory and Gated Recurrent Unit

by Samiullah Saleem, Zaheer Ul Islam, Syed Shabih Ul Hasan, Habib Akbar, Muhammad Faizan Khan and Syed Adil Ibrar

Appl. Sci. 2025, 15(13), 7407; https://doi.org/10.3390/app15137407 - 1 Jul 2025

Viewed by 747

Abstract

In today’s business environment, emails are essential across all sectors, including finance and academia. There are two main types of emails: ham (legitimate) and spam (unsolicited). Spam wastes consumers’ time and resources and poses risks to sensitive data, with volumes doubling daily. Current [...] Read more.

In today’s business environment, emails are essential across all sectors, including finance and academia. There are two main types of emails: ham (legitimate) and spam (unsolicited). Spam wastes consumers’ time and resources and poses risks to sensitive data, with volumes doubling daily. Current spam identification methods, such as Blocklist approaches and content-based techniques, have limitations, highlighting the need for more effective solutions. These constraints call for detailed and more accurate approaches, such as machine learning (ML) and deep learning (DL), for realistic detection of new scams. Emphasis has since been placed on the possibility that ML and DL technologies are present in detecting email spam. In this work, we have succeeded in developing a hybrid deep learning model, where Long Short-Term Memory (LSTM) and the Gated Recurrent Unit (GRU) are applied distinctly to identify spam email. Despite the fact that the other models have been applied independently (CNNs, LSTM, GRU, or ensemble machine learning classifier) in previous studies, the given research has provided a contribution to the existing body of literature since it has managed to combine the advantage of LSTM in capturing the long-term dependency and the effectiveness of GRU in terms of computational efficiency. In this hybridization, we have addressed key issues such as the vanishing gradient problem and outrageous resource consumption that are usually encountered in applying standalone deep learning. Moreover, our proposed model is superior regarding the detection accuracy (90%) and AUC (98.99%). Though Transformer-based models are significantly lighter and can be used in real-time applications, they require extensive computation resources. The proposed work presents a substantive and scalable foundation to spam detection that is technically and practically dissimilar to the familiar approaches due to the powerful preprocessing steps, including particular stop-word removal, TF-IDF vectorization, and model testing on large, real-world size dataset (Enron-Spam). Additionally, delays in the feature comparison technique within the model minimize false positives and false negatives. Full article

► Show Figures

Figure 1

22 pages, 3576 KB

Open AccessArticle

A Deep Learning Approach to Unveil Types of Mental Illness by Analyzing Social Media Posts

by Rajashree Dash, Spandan Udgata, Rupesh K. Mohapatra, Vishanka Dash and Ashrita Das

Math. Comput. Appl. 2025, 30(3), 49; https://doi.org/10.3390/mca30030049 - 3 May 2025

Viewed by 1118

Abstract

Mental illness has emerged as a widespread global health concern, often unnoticed and unspoken. In this era of digitization, social media has provided a prominent space for people to express their feelings and find solutions faster. Thus, this area of study with a [...] Read more.

Mental illness has emerged as a widespread global health concern, often unnoticed and unspoken. In this era of digitization, social media has provided a prominent space for people to express their feelings and find solutions faster. Thus, this area of study with a sheer amount of information, which refers to users’ behavioral attributes combined with the power of machine learning (ML), can be explored to make the entire diagnosis process smooth. In this study, an efficient ML model using Long Short-Term Memory (LSTM) is developed to determine the kind of mental illness a user may have using a random text made by the user on their social media. This study is based on natural language processing, where the prerequisites involve data collection from different social media sites and then pre-processing the collected data as per the requirements through stemming, lemmatization, stop word removal, etc. After examining the linguistic patterns of different social media posts, a reduced feature space is generated using appropriate feature engineering, which is further fed as input to the LSTM model to identify a type of mental illness. The performance of the proposed model is also compared with three other ML models, which includes using the full feature space and the reduced one. The optimal resulting model is selected by training and testing all of the models on the publicly available Reddit Mental Health Dataset. Overall, utilizing deep learning (DL) for mental health analysis can offer a promising avenue toward improved interventions, outcomes, and a better understanding of mental health issues at both the individual and population levels, aiding in decision-making processes. Full article

(This article belongs to the Section Engineering)

► Show Figures

Figure 1

26 pages, 610 KB

Open AccessArticle

A Black-Box Analysis of the Capacity of ChatGPT to Generate Datasets of Human-like Comments

by Alejandro Rosete, Guillermo Sosa-Gómez and Omar Rojas

Computers 2025, 14(5), 162; https://doi.org/10.3390/computers14050162 - 27 Apr 2025

Viewed by 1455

Abstract

This paper examines the ability of ChatGPT to generate synthetic comment datasets that mimic those produced by humans. To this end, a collection of datasets containing human comments, freely available in the Kaggle repository, was compared to comments generated via ChatGPT. The latter [...] Read more.

This paper examines the ability of ChatGPT to generate synthetic comment datasets that mimic those produced by humans. To this end, a collection of datasets containing human comments, freely available in the Kaggle repository, was compared to comments generated via ChatGPT. The latter were based on prompts designed to provide the necessary context for approximating human results. It was hypothesized that the responses obtained from ChatGPT would demonstrate a high degree of similarity with the human-generated datasets with regard to vocabulary usage. Two categories of prompts were analyzed, depending on whether they specified the desired length of the generated comments. The evaluation of the results primarily focused on the vocabulary used in each comment dataset, employing several analytical measures. This analysis yielded noteworthy observations, which reflect the current capabilities of ChatGPT in this particular task domain. It was observed that ChatGPT typically employs a reduced number of words compared to human respondents and tends to provide repetitive answers. Furthermore, the responses of ChatGPT have been observed to vary considerably when the length is specified. It is noteworthy that ChatGPT employs a smaller vocabulary, which does not always align with human language. Furthermore, the proportion of non-stop words in ChatGPT’s output is higher than that found in human communication. Finally, the vocabulary of ChatGPT is more closely aligned with human language than the similarity between the two configurations of ChatGPT. This alignment is particularly evident in the use of stop words. While it does not fully achieve the intended purpose, the generated vocabulary serves as a reasonable approximation, enabling specific applications such as the creation of word clouds. Full article

(This article belongs to the Special Issue Harnessing Artificial Intelligence for Social and Semantic Understanding)

► Show Figures

Figure 1

25 pages, 1964 KB

Open AccessArticle

Hate Speech Detection and Online Public Opinion Regulation Using Support Vector Machine Algorithm: Application and Impact on Social Media

by Siyuan Li and Zhi Li

Information 2025, 16(5), 344; https://doi.org/10.3390/info16050344 - 24 Apr 2025

Viewed by 919

Abstract

Detecting hate speech in social media is challenging due to its rarity, high-dimensional complexity, and implicit expression via sarcasm or spelling variations, rendering linear models ineffective. In this study, the SVM (Support Vector Machine) algorithm is used to map text features from low-dimensional [...] Read more.

Detecting hate speech in social media is challenging due to its rarity, high-dimensional complexity, and implicit expression via sarcasm or spelling variations, rendering linear models ineffective. In this study, the SVM (Support Vector Machine) algorithm is used to map text features from low-dimensional to high-dimensional space using kernel function techniques to meet complex nonlinear classification challenges. By maximizing the category interval to locate the optimal hyperplane and combining nuclear techniques to implicitly adjust the data distribution, the classification accuracy of hate speech detection is significantly improved. Data collection leverages social media APIs (Application Programming Interface) and customized crawlers with OAuth2.0 authentication and keyword filtering, ensuring relevance. Regular expressions validate data integrity, followed by preprocessing steps such as denoising, stop-word removal, and spelling correction. Word embeddings are generated using Word2Vec’s Skip-gram model, combined with TF-IDF (Term Frequency–Inverse Document Frequency) weighting to capture contextual semantics. A multi-level feature extraction framework integrates sentiment analysis via lexicon-based methods and BERT for advanced sentiment recognition. Experimental evaluations on two datasets demonstrate the SVM model’s effectiveness, achieving accuracies of 90.42% and 92.84%, recall rates of 88.06% and 90.79%, and average inference times of 3.71 ms and 2.96 ms. These results highlight the model’s ability to detect implicit hate speech accurately and efficiently, supporting real-time monitoring. This research contributes to creating a safer online environment by advancing hate speech detection methodologies. Full article

(This article belongs to the Special Issue Information Technology in Society)

► Show Figures

Figure 1

17 pages, 766 KB

Open AccessArticle

Preventing Posterior Collapse with DVAE for Text Modeling

by Tianbao Song, Zongyi Huang, Xin Liu and Jingbo Sun

Entropy 2025, 27(4), 423; https://doi.org/10.3390/e27040423 - 14 Apr 2025

Viewed by 857

Abstract

This paper introduces a novel variational autoencoder model termed DVAE to prevent posterior collapse in text modeling. DVAE employs a dual-path architecture within its decoder: path A and path B. Path A makes the direct input of text instances into the decoder, whereas [...] Read more.

This paper introduces a novel variational autoencoder model termed DVAE to prevent posterior collapse in text modeling. DVAE employs a dual-path architecture within its decoder: path A and path B. Path A makes the direct input of text instances into the decoder, whereas path B replaces a subset of word tokens in the text instances with a generic unknown token before their input into the decoder. A stopping strategy is implemented, wherein both paths are concurrently active during the early phases of training. As the model progresses towards convergence, path B is removed. To further refine the performance, a KL weight dropout method is employed, which randomly sets certain dimensions of the KL weight to zero during the annealing process. DVAE compels the latent variables to encode more information about the input texts through path B and fully utilize the expressiveness of the decoder, as well as avoiding the local optimum when path B is active through path A and the stopping strategy. Furthermore, the KL weight dropout method augments the number of active units within the latent variables. Experimental results show the excellent performance of DVAE in density estimation, representation learning, and text generation. Full article

► Show Figures

Figure 1

23 pages, 2751 KB

Open AccessArticle

Speech Production Development in Mandarin-Speaking Children: A Case of Lingual Stop Consonants

by Fangfang Li

Behav. Sci. 2025, 15(4), 516; https://doi.org/10.3390/bs15040516 - 13 Apr 2025

Viewed by 573

Abstract

Lingual stops are among the earliest sounds acquired by young children, but the process of acquiring the temporal coordination of lingual gestures necessary for the production of stop consonants appears to be protracted. The current research aims to investigate the developmental process of [...] Read more.

Lingual stops are among the earliest sounds acquired by young children, but the process of acquiring the temporal coordination of lingual gestures necessary for the production of stop consonants appears to be protracted. The current research aims to investigate the developmental process of lingual stop consonants in 100 Mandarin-speaking 2- to 5-year-olds using the acoustic parameter voice onset time (VOT). Children were engaged in a word-repetition task and recorded while producing words that begin with /t/, /d/, /k/, and /g/. Results indicate well-established contrasts between /t/ and /d/ as well as between /k/ and /g/ by age 2. However, comparing with adults’ speech patterns, children’s speech productions are characterized by greater within-category dispersion and overlap, as well as smaller phoneme discriminability. Mandarin-speaking children also go through an “overshoot” stage by producing longer-than-adult VOT values, especially for voiceless aspirated stops /t/ and /k/. Lastly, unlike adults who exhibit gender-specific patterns in VOT, boys and girls do not show distinct patterns in their VOT by age 5. These results will be discussed in relation to children’s lingual motor control development and the organization of phonological and phonetic structures during the process of language acquisition. Full article

(This article belongs to the Special Issue Developing Cognitive and Executive Functions Across Lifespan)

► Show Figures

Figure 1

17 pages, 1841 KB

Open AccessArticle

Monitoring of Sustainable Development Trends: Text Mining in Regional Media

by Galina Chernyshova, Evgeniy Taran, Anna Firsova and Alla Vavilina

Sustainability 2025, 17(7), 3122; https://doi.org/10.3390/su17073122 - 1 Apr 2025

Cited by 1 | Viewed by 828

Abstract

The monitoring of regional development sustainability is closely linked to the development of an indicator system that best meets stakeholders’ requirements, providing a solid foundation for strategic decision-making. In pursuit of progress in achieving the Sustainable Development Goals (SDG), efforts are continuously being [...] Read more.

The monitoring of regional development sustainability is closely linked to the development of an indicator system that best meets stakeholders’ requirements, providing a solid foundation for strategic decision-making. In pursuit of progress in achieving the Sustainable Development Goals (SDG), efforts are continuously being undertaken to refine and enhance the indicator framework. Implementing interdisciplinary approaches for a comprehensive assessment of sustainable development in regions allows for a swift expansion and augmentation of data on regional transformations. An important aspect of the study of sustainability at the regional level is the additional possibility of using unstructured news content through text mining methods. The issue of applying natural language processing techniques for Russian-language sources is significant, as a large number of relevant tools are developed for English. Additionally, the analysis of news content has several features that complicate the classification of sentiments of messages with mostly neutral wording. The proposed methodology for processing specific news content in assessing the sustainability of regional development was implemented. An application for data scraping was developed, data were collected taking into account the selected regions and periods, stop word dictionaries were configured, frequency analysis was implemented, and the sentiment analysis of the obtained slices was carried out. For the formed set of news documents related to sustainable development by keywords according to SDGs 1–17, for the regions of the Volga Federal District, a corpus of documents was obtained representing data for 2021, 2022, and 2023 for 14 regions. The analysis of key topics for different areas and periods was carried out using the cosine similarity measure. The developed approach to news analysis allows for increasing the efficiency of monitoring on various topics. This methodology has been tested for systemic and operational assessment in the dynamics of the sustainable development of regions. Text analysis methods within the framework of decision support at the regional level provide the opportunity to identify emerging trends. Full article

(This article belongs to the Section Development Goals towards Sustainability)

► Show Figures

Figure 1

25 pages, 4295 KB

Open AccessArticle

Sound Change and Consonant Devoicing in Word-Final Sibilants: A Study of Brazilian Portuguese Plural Forms

by Wellington Mendes

Languages 2025, 10(3), 48; https://doi.org/10.3390/languages10030048 - 7 Mar 2025

Viewed by 1103

Abstract

This study investigates consonant devoicing in Brazilian Portuguese (BP), in order to assess whether an ongoing sound change is taking place. We examine plural forms consisting of a stop consonant followed by a word-final sibilant, such as in redes [hedz] ~ [heds] ~ [...] Read more.

This study investigates consonant devoicing in Brazilian Portuguese (BP), in order to assess whether an ongoing sound change is taking place. We examine plural forms consisting of a stop consonant followed by a word-final sibilant, such as in redes [hedz] ~ [heds] ~ [hets] and sedes [sɛdz] ~ [sɛds] ~ [sɛts], focusing on the emergence of voiceless sibilants before word-initial vowels (e.g., redes amarelas, ‘yellow hammocks’). If sibilants remain voiceless despite a following vowel, this challenges the expected regressive voicing assimilation in BP and raises the question of the conditions under which this devoicing occurs. Data were collected through recordings of oral production from twenty Brazilian speakers, using reading and picture naming tasks. Sibilant voicing was quantified using harmonics-to-noise ratio (HNR). A linear mixed-effects model—including random intercepts and slopes for both speakers and words—reveals that sibilants are significantly more voiced before a vowel than before a pause, but this voicing is substantially reduced when the sibilant is preceded by voiceless consonants. These findings indicate an ongoing devoicing process at pre-vocalic word boundaries in BP, affecting clusters [pz, tz, kz] and [bz, dz, gz] alike. Spectrographic analyses indicate that not only the sibilants but also their preceding stop may exhibit devoicing. Moreover, minimal-pair considerations suggest that speakers potentially maintain sibilant voicing in certain lexical items to preserve intelligibility (e.g., gra[dz] ‘grades’ and se[dz] ‘headquarters’ vs. grá[ts] ‘free’ and se[ts] ‘sets’). Drawing on Exemplar Theory, we propose a competition between the influence of the phonological environment and word-final devoicing: sibilants are sometimes voiced due to a following vowel (e.g., botes argentinos [bɔtz ah.ʒẽ.’tʃi.nus] ‘Argentine boats’), but they often emerge as voiceless due to consonantal devoicing (e.g., [bɔts ah.ʒẽ.’tʃi.nus]), resulting in both expected and unexpected forms. We suggest that fine phonetic detail, whether associated with allophonic or emergent sound patterns, contributes to the construction of phonological representations. Full article

(This article belongs to the Special Issue Phonetics and Phonology of Ibero-Romance Languages)

► Show Figures

Figure 1

27 pages, 4951 KB

Open AccessArticle

The Link Between Perception and Production in the Laryngeal Processes of Multilingual Speakers

by Zsuzsanna Bárkányi and Zoltán G. Kiss

Languages 2025, 10(2), 29; https://doi.org/10.3390/languages10020029 - 5 Feb 2025

Viewed by 1273

Abstract

The present paper investigates the link between perception and production in the laryngeal phonology of multilingual speakers, focusing on non-contrastive segments and the dynamic aspect of these processes. Fourteen L1 Hungarian, L2 English, and L3 Spanish advanced learners took part in the experiments. [...] Read more.

The present paper investigates the link between perception and production in the laryngeal phonology of multilingual speakers, focusing on non-contrastive segments and the dynamic aspect of these processes. Fourteen L1 Hungarian, L2 English, and L3 Spanish advanced learners took part in the experiments. The production experiments examined the aspiration of voiceless stops in word-initial position, regressive voicing assimilation, and pre-sonorant voicing; the latter two processes were analyzed both word-internally and across word boundaries. The perception experiments aimed to find out whether learners notice the phonetic outputs of these processes and regard them as linguistically relevant. Our results showed that perception and production are not aligned. Accurate production is dependent on accurate perception, but accurate perception is not necessarily transferred into production. In laryngeal postlexical processes, the native language seems to play the primary role even for highly competent learners, but markedness might be relevant too. The novel findings of this study are that phonetic category formation seems to be easier than the acquisition of dynamic allophonic alternations and that metaphonological awareness is correlated with perception but not with production. Full article

(This article belongs to the Special Issue Advances in the Investigation of L3 Speech Perception)

► Show Figures

Figure 1

15 pages, 1119 KB

Open AccessArticle

Fit Talks: Forecasting Fitness Awareness in Saudi Arabia Using Fine-Tuned Transformers

by Nora Alturayeif, Deemah Alqahtani, Sumayh S. Aljameel, Najla Almajed, Lama Alshehri, Nourah Aldhuwaihi, Madawi Alhadyan and Nouf Aldakheel

Big Data Cogn. Comput. 2025, 9(2), 20; https://doi.org/10.3390/bdcc9020020 - 23 Jan 2025

Viewed by 1346

Abstract

Understanding public sentiment on health and fitness is essential for addressing regional health challenges in Saudi Arabia. This research employs sentiment analysis to assess fitness awareness by analyzing content from the X platform (formerly Twitter), using a dataset called Saudi Aware, which includes [...] Read more.

Understanding public sentiment on health and fitness is essential for addressing regional health challenges in Saudi Arabia. This research employs sentiment analysis to assess fitness awareness by analyzing content from the X platform (formerly Twitter), using a dataset called Saudi Aware, which includes 3593 posts related to fitness awareness. Preprocessing steps such as normalization, stop-word removal, and tokenization ensured high-quality data. The findings revealed that positive sentiments about fitness and health were more prevalent than negative ones, with posts across all sentiment categories being most common in the western region. However, the eastern region exhibited the highest percentage of positive sentiment, indicating a strong interest in fitness and health. For sentiment classification, we fine-tuned two transformer architectures—BERT and GPT—utilizing three BERT-based models (AraBERT, MARBERT, CAMeLBERT) and GPT-3.5. These findings provide valuable insights into Saudi Arabian attitudes toward fitness and health, offering actionable information for public health campaigns and initiatives. Full article

► Show Figures

Figure 1

17 pages, 1865 KB

Open AccessArticle

Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques

by Zineb Nassr, Faouzia Benabbou, Nawal Sael and Touria Hamim

Information 2025, 16(1), 39; https://doi.org/10.3390/info16010039 - 10 Jan 2025

Viewed by 1494

Abstract

Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like [...] Read more.

Sentiment analysis is a crucial component of text mining and natural language processing (NLP), involving the evaluation and classification of text data based on its emotional tone, typically categorized as positive, negative, or neutral. While significant research has focused on structured languages like English, unstructured languages, such as the Moroccan Dialect (MD), face substantial resource limitations and linguistic challenges, making effective sentiment analysis difficult. This study addresses this gap by exploring the integration of data-balancing techniques with machine learning (ML) methods, specifically investigating the impact of resampling techniques and feature extraction methods, including Term Frequency–Inverse Document Frequency (TF-IDF), Bag of Words (BOW), and N-grams. Through rigorous experimentation, we evaluate the effectiveness of these approaches in enhancing sentiment analysis accuracy for the Moroccan dialect. Our findings demonstrate that strategic resampling, combined with the TF-IDF method, significantly improves classification accuracy and robustness. We also explore the interaction between resampling strategies and feature extraction methods, revealing varying levels of effectiveness across different combinations. Notably, the Support Vector Machine (SVM) classifier, when paired with TF-IDF representation, achieves superior performance, with an accuracy of 90.24% and a precision of 90.34%. These results highlight the importance of tailored resampling techniques, appropriate feature extraction methods, and machine learning optimization in advancing sentiment analysis for under-resourced and dialect-heavy languages like the Moroccan dialect, providing a practical framework for future research and development in NLP for unstructured languages. Full article

(This article belongs to the Special Issue Application of Machine Learning in Data Science and Computational Intelligence)

► Show Figures

Graphical abstract

24 pages, 2740 KB

Open AccessArticle

Stop-Lateral Clusters in French and Spanish: Articulatory Timing Differences and Synchronic Patterns

by Laura Colantoni, Alexei Kochetov and Jeffrey Steele

Languages 2024, 9(12), 381; https://doi.org/10.3390/languages9120381 - 20 Dec 2024

Viewed by 2241

Abstract

While both French and Spanish have complex onsets, the languages differ in the variety and distribution of clusters allowed as well as in the realization of voiced stops. The present study examines the effects of C1 voicing, place of articulation, and language on [...] Read more.

While both French and Spanish have complex onsets, the languages differ in the variety and distribution of clusters allowed as well as in the realization of voiced stops. The present study examines the effects of C1 voicing, place of articulation, and language on the production of word-initial /pl bl kl gl/ using a combination of electropalatographic (C1 and C2 linguopalatal contact and timing) and acoustic measures (duration and relative intensity) from 4 French and 7 Spanish speakers. Certain between-language similarities and differences in the effects of voicing and place on intergestural timing were observed. In particular, (1) both languages showed more overlap in clusters where C1 was velar rather than labial; (2) the effect of voicing (more overlap in clusters with a voiced C1) was restricted to French; and (3) lateral duration was unaffected by C1 place or voicing, while C1 duration was strongly affected by stress and voicing in Spanish alone given the approximantization of voiced stops. These results contribute to a better understanding of the general mechanisms and language-specific patterns of intergestural coordination in onset clusters and add to the growing body of articulatory work on these complex structures in Romance languages. Full article

(This article belongs to the Special Issue Phonetic and Phonological Complexity in Romance Languages)

► Show Figures

Figure 1

28 pages, 3221 KB

Open AccessArticle

Dissimilation in Hispano-Romance Diminutive Suffixation

by Claire Julia Lozano and Travis G. Bradley

Languages 2024, 9(12), 380; https://doi.org/10.3390/languages9120380 - 20 Dec 2024

Viewed by 1157

Abstract

A highly productive derivational process, diminutive suffixation in Spanish (e.g., gatito ~ gatiko/gatico ‘little/well-known/beloved/awful cat’ < gato ‘cat’) has received much attention in the morphology–phonology interface literature. The present study contributes a novel comparative analysis of a dissimilatory alternation between diminutive suffix allomorphs [...] Read more.

A highly productive derivational process, diminutive suffixation in Spanish (e.g., gatito ~ gatiko/gatico ‘little/well-known/beloved/awful cat’ < gato ‘cat’) has received much attention in the morphology–phonology interface literature. The present study contributes a novel comparative analysis of a dissimilatory alternation between diminutive suffix allomorphs -ito/a and -ico/a (-iko/a) across three Hispano-Romance varieties. In Judeo-Spanish, the voiceless dorsal stop [k] of default -iko/a dissimilates to coronal [t] after any dorsal segment [k, ɡ, ɡʷ, x, w] in the base-final syllable. In Colombian Spanish, the voiceless coronal stop [t] of default -ito/a dissimilates to dorsal [k] after only an identical [t] in the base-final syllable. By contrast, Castilian Spanish -ito/a does not dissimilate, thereby providing a baseline for comparison. All three varieties allow for optional iteration of the suffix, which conveys greater smallness or endearment than the simple diminutive, e.g., Castilian Spanish gatitito ‘little/beloved kitty’, without dissimilation. Iterated diminutives in Colombian Spanish show two patterns of dissimilation, which have not been fully acknowledged in the previous literature. For example, either (i) [it] and [ik] alternate to avoid adjacent identical syllable onsets, e.g., gat[ikitíko], or (ii) [it] is iterated until alternating with word-final [ik], e.g., gat[ititíko]. In all three Hispano-Romance varieties, base-final unstressed vowels are deleted before a vowel-initial diminutive suffix, followed by unstressed -o/a, and stress (indicated by an acute accent) is shifted rightward onto the penultimate syllable of the diminutive word. Vowel deletion and stress shift apply recursively in iterated diminutives. We propose an Optimality Theory analysis of these alternations in terms of suffix allomorphy that is phonologically conditioned by consonantal place dissimilation. The analysis is formalized as an interaction among constraints that enforce prosodic unmarkedness, output–output correspondence, allomorph preference, and similarity avoidance. We consider theoretical alternatives and compare our analysis to other recent proposals. Full article

(This article belongs to the Special Issue Phonetics and Phonology of Ibero-Romance Languages)

► Show Figures

Figure 1

Search Results (85)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (85)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI