Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (33)

Search Parameters:
Keywords = lexical selection task

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 2431 KB  
Article
Perceptual Plasticity in Bilinguals: Language Dominance Reshapes Acoustic Cue Weightings
by Annie Tremblay and Hyoju Kim
Brain Sci. 2025, 15(10), 1053; https://doi.org/10.3390/brainsci15101053 - 27 Sep 2025
Viewed by 693
Abstract
Background/Objectives: Speech perception is shaped by language experience, with listeners learning to selectively attend to acoustic cues that are informative in their language. This study investigates how language dominance, a proxy for long-term language experience, modulates cue weighting in highly proficient Spanish–English bilinguals’ [...] Read more.
Background/Objectives: Speech perception is shaped by language experience, with listeners learning to selectively attend to acoustic cues that are informative in their language. This study investigates how language dominance, a proxy for long-term language experience, modulates cue weighting in highly proficient Spanish–English bilinguals’ perception of English lexical stress. Methods: We tested 39 bilinguals with varying dominance profiles and 40 monolingual English speakers in a stress identification task using auditory stimuli that independently manipulated vowel quality, pitch, and duration. Results: Bayesian logistic regression models revealed that, compared to monolinguals, bilinguals relied less on vowel quality and more on pitch and duration, mirroring cue distributions in Spanish versus English. Critically, cue weighting within the bilingual group varied systematically with language dominance: English-dominant bilinguals patterned more like monolingual English listeners, showing increased reliance on vowel quality and decreased reliance on pitch and duration, whereas Spanish-dominant bilinguals retained a cue weighting that was more Spanish-like. Conclusions: These results support experience-based models of speech perception and provide behavioral evidence that bilinguals’ perceptual attention to acoustic cues remains flexible and dynamically responsive to long-term input. These results are in line with a neurobiological account of speech perception in which attentional and representational mechanisms adapt to changes in the input. Full article
(This article belongs to the Special Issue Language Perception and Processing)
Show Figures

Figure 1

34 pages, 1172 KB  
Article
Leveraging LLMs for Automated Extraction and Structuring of Educational Concepts and Relationships
by Tianyuan Yang, Baofeng Ren, Chenghao Gu, Tianjia He, Boxuan Ma and Shin’ichi Konomi
Mach. Learn. Knowl. Extr. 2025, 7(3), 103; https://doi.org/10.3390/make7030103 - 19 Sep 2025
Cited by 1 | Viewed by 1719
Abstract
Students must navigate large catalogs of courses and make appropriate enrollment decisions in many online learning environments. In this context, identifying key concepts and their relationships is essential for understanding course content and informing course recommendations. However, identifying and extracting concepts can be [...] Read more.
Students must navigate large catalogs of courses and make appropriate enrollment decisions in many online learning environments. In this context, identifying key concepts and their relationships is essential for understanding course content and informing course recommendations. However, identifying and extracting concepts can be an extremely labor-intensive and time-consuming task when it has to be done manually. Traditional NLP-based methods to extract relevant concepts from courses heavily rely on resource-intensive preparation of detailed course materials, thereby failing to minimize labor. As recent advances in large language models (LLMs) offer a promising alternative for automating concept identification and relationship inference, we thoroughly investigate the potential of LLMs in automatically generating course concepts and their relations. Specifically, we systematically evaluate three LLM variants (GPT-3.5, GPT-4o-mini, and GPT-4o) across three distinct educational tasks, which are concept generation, concept extraction, and relation identification, using six systematically designed prompt configurations that range from minimal context (course title only) to rich context (course description, seed concepts, and subtitles). We systematically assess model performance through extensive automated experiments using standard metrics (Precision, Recall, F1, and Accuracy) and human evaluation by four domain experts, providing a comprehensive analysis of how prompt design and model choice influence the quality and reliability of the generated concepts and their interrelations. Our results show that GPT-3.5 achieves the highest scores on quantitative metrics, whereas GPT-4o and GPT-4o-mini often generate concepts that are more educationally meaningful despite lexical divergence from the ground truth. Nevertheless, LLM outputs still require expert revision, and performance is sensitive to prompt complexity. Overall, our experiments demonstrate the viability of LLMs as a tool for supporting educational content selection and delivery. Full article
Show Figures

Graphical abstract

34 pages, 3234 KB  
Article
L1 Attrition vis-à-vis L2 Acquisition: Lexicon, Syntax–Pragmatics Interface, and Prosody in L1-English L2-Italian Late Bilinguals
by Mattia Zingaretti, Vasiliki Chondrogianni, D. Robert Ladd and Antonella Sorace
Languages 2025, 10(9), 224; https://doi.org/10.3390/languages10090224 - 4 Sep 2025
Cited by 1 | Viewed by 2170
Abstract
Late bilingual speakers immersed in a second language (L2) environment often experience the non-pathological attrition of their first language (L1), exhibiting selective and reversible changes in L1 processing and production. While attrition research has largely focused on long-term residents in anglophone countries, examining [...] Read more.
Late bilingual speakers immersed in a second language (L2) environment often experience the non-pathological attrition of their first language (L1), exhibiting selective and reversible changes in L1 processing and production. While attrition research has largely focused on long-term residents in anglophone countries, examining changes primarily within a single L1 domain, the present study employs a novel experimental design to investigate L1 attrition, alongside L2 acquisition, across three domains (i.e., the lexicon, syntax–pragmatics interface, and prosody) in two groups of L1-English L2-Italian late bilinguals: long-term residents in Italy vs. university students in the UK. A total of 112 participants completed online tasks assessing lexical retrieval, anaphora resolution, and sentence stress patterns in both languages. First, both bilingual groups showed comparable levels of semantic interference in lexical retrieval. Second, at the syntax–pragmatics interface, only residents in Italy showed signs of L1 attrition in real-time processing of anaphora, while resolution preferences were similar between groups; in the L2, both bilingual groups demonstrated target-like preferences, despite some slowdown in processing. Third, while both groups showed some evidence of target-like L2 prosody, with residents in Italy matching L1-Italian sentence stress patterns closely, prosodic attrition was only reported for residents in Italy in exploratory analyses. Overall, this study supports the notion of L1 attrition as a natural consequence of bilingualism—one that is domain- and experience-dependent, unfolds along a continuum, and involves a complex (and possibly inverse) relationship between L1 and L2 performance that warrants further investigation. Full article
Show Figures

Figure 1

19 pages, 1759 KB  
Article
Lost and Found? Shifts in Heritage Speakers’ Processing of Mood Morphology over the Course of a Semester Abroad
by David Giancaspro and Sara Fernández Cuenca
Languages 2025, 10(7), 163; https://doi.org/10.3390/languages10070163 - 29 Jun 2025
Viewed by 883
Abstract
Of the few studies that have investigated the linguistic development of heritage speakers (HSs) in the study abroad (SA) context, none have utilized on-line experiments, in spite of these tasks’ clear methodological benefits. In this study, therefore, we test HSs’ on-line sensitivity to [...] Read more.
Of the few studies that have investigated the linguistic development of heritage speakers (HSs) in the study abroad (SA) context, none have utilized on-line experiments, in spite of these tasks’ clear methodological benefits. In this study, therefore, we test HSs’ on-line sensitivity to lexically selected mood morphology in Spanish. Ten adult HSs completed a self-paced reading task at the beginning and end of a fifteen-week-long SA program in Spain. The task assessed both (a) whether HSs were sensitive to mood incongruencies (e.g., by slowing down after ungrammatical verbs) and (b) whether that (in)sensitivity was different with regular vs. irregular verbs. It was hypothesized that participants would be more sensitive to mood with irregular verbs and that their mood sensitivity would increase over the course of the semester abroad, but these hypotheses were only partially supported. Although HSs developed sensitivity to mood incongruencies with regular verbs over the course of the semester abroad, they showed the reverse pattern with irregular verbs, demonstrating sensitivity at Session 1 but not Session 2. Nonetheless, because participants’ reading times decreased sharply over the semester—and without any concomitant decrease in comprehension accuracy—we argue that SA immersion likely does facilitate morphosyntactic processing in the HL. Full article
(This article belongs to the Special Issue Language Processing in Spanish Heritage Speakers)
Show Figures

Figure 1

19 pages, 2419 KB  
Article
Combining Lexicon Definitions and the Retrieval-Augmented Generation of a Large Language Model for the Automatic Annotation of Ancient Chinese Poetry
by Jiabin Li, Tingxin Wei, Weiguang Qu, Bin Li, Minxuan Feng and Dongbo Wang
Mathematics 2025, 13(12), 2023; https://doi.org/10.3390/math13122023 - 19 Jun 2025
Cited by 1 | Viewed by 1440
Abstract
Existing approaches to the automatic annotation of classical Chinese poetry often fail to generate precise source citations and depend heavily on manual segmentation, limiting their scalability and accuracy. To address these shortcomings, we propose a novel paradigm that integrates dictionary retrieval with retrieval-augmented [...] Read more.
Existing approaches to the automatic annotation of classical Chinese poetry often fail to generate precise source citations and depend heavily on manual segmentation, limiting their scalability and accuracy. To address these shortcomings, we propose a novel paradigm that integrates dictionary retrieval with retrieval-augmented large language model enhancements for automatic poetic annotation. Our method leverages the contextual understanding capabilities of large models to dynamically select appropriate lexical senses and employs an automated segmentation technique to minimize reliance on manual splitting. For poetic segments absent from standard dictionaries, the system retrieves pertinent information from a domain-specific knowledge base and generates definitions grounded in this auxiliary data, thereby substantially improving both annotation accuracy and coverage. The experimental results demonstrate that our approach outperforms general-purpose large language models and pre-trained classical Chinese language models on automatic annotation tasks; notably, it achieves a micro-averaged accuracy of 94.33% on key semantic segments. By delivering more precise and comprehensive annotations, this framework advances the computational analysis of classical Chinese poetry and offers significant potential for intelligent teaching applications and digital humanities research. Full article
Show Figures

Figure 1

18 pages, 512 KB  
Article
Animate, or Inanimate, That Is the Question for Large Language Models
by Giulia Pucci, Fabio Massimo Zanzotto and Leonardo Ranaldi
Information 2025, 16(6), 493; https://doi.org/10.3390/info16060493 - 13 Jun 2025
Viewed by 1216
Abstract
The cognitive core of human beings is closely connected to the concept of animacy, which significantly influences their memory, vision, and complex language comprehension. While animacy is reflected in language through subtle constraints on verbs and adjectives, it is also acquired and honed [...] Read more.
The cognitive core of human beings is closely connected to the concept of animacy, which significantly influences their memory, vision, and complex language comprehension. While animacy is reflected in language through subtle constraints on verbs and adjectives, it is also acquired and honed through non-linguistic experiences. In the same vein, we suggest that the limited capacity of LLMs to grasp natural language, particularly in relation to animacy, stems from the fact that these models are trained solely on textual data. Hence, the question this paper aims to answer arises: Can LLMs, in their digital wisdom, process animacy in a similar way to what humans would do? We then propose a systematic analysis via prompting approaches. In particular, we probe different LLMs using controlled lexical contrasts (animate vs. inanimate nouns) and narrative contexts in which typically inanimate entities behave as animate. Results reveal that, although LLMs have been trained predominantly on textual data, they exhibit human-like behavior when faced with typical animate and inanimate entities in alignment with earlier studies, specifically on seven LLMs selected from three major families—OpenAI (GPT-3.5, GPT-4), Meta (Llama2 7B, 13B, 70B), and Mistral (Mistral-7B, Mixtral). GPT models generally achieve the most consistent and human-like performance, and in some tasks, such as sentence plausibility and acceptability judgments, even surpass human baselines. Moreover, although to a lesser degree, the other models also assume comparable results. Hence, LLMs can adapt to understand unconventional situations by recognising oddities as animated without needing to interface with unspoken cognitive triggers humans rely on to break down animations. Full article
Show Figures

Figure 1

28 pages, 1825 KB  
Article
Letter and Word Processing in Developmental Dyslexia: Evidence from a Two-Alternative Forced Choice Task
by Daniela Traficante, Pierluigi Zoccolotti and Chiara Valeria Marinelli
Children 2025, 12(5), 572; https://doi.org/10.3390/children12050572 - 29 Apr 2025
Viewed by 837
Abstract
Background/Objectives: The present study aimed to investigate letter processing in children with dyslexia and typically developing readers as a function of the type of orthographic context. Methods and Results: In Experiment 1A, children performed a two-alternative forced choice task (Reicher–Wheeler paradigm) using as [...] Read more.
Background/Objectives: The present study aimed to investigate letter processing in children with dyslexia and typically developing readers as a function of the type of orthographic context. Methods and Results: In Experiment 1A, children performed a two-alternative forced choice task (Reicher–Wheeler paradigm) using as probes either high-frequency words, pronounceable pseudo-words, or unpronounceable non-words. The group differences in letter recognition were clearly distinguished from those present in typical word and pseudo-word reading conditions (Experiment 1B), as a global factor was present only in the latter case. In Experiment 2, the two-alternative forced choice task required the child to search for the target letter in the subsequent multi-letter string (i.e., words, pseudo-words, or non-words), thus reducing the memory load. Detecting the target letter was more difficult in a word than in a pseudo-word or non-word array, indicating that the word form’s lexical activation interfered with the target’s analysis in both groups of children. In Experiment 3, children performed the two-alternative forced choice task with symbols (Greek letters) either in the Reicher–Wheeler mode of presentation (Experiment 3A) or in the search condition (Experiment 3B). Children with dyslexia performed identically to typically developing readers in keeping with the selectivity of their orthographic difficulties. Conclusions: The present data indicate that children with dyslexia suffer from an early deficit in making perceptual operations that require the conjunction analysis of a set of letters. Still, this deficit is not due to an inability to scan the letter string. The deficit is confined to orthographic stimuli and does not extend to other types of visual targets. Full article
Show Figures

Figure 1

22 pages, 3861 KB  
Article
Exploring the Relationship Between Preference and Production as Indicators of L2 Sociophonetic Competence
by Megan Solon and Matthew Kanwit
Languages 2025, 10(4), 65; https://doi.org/10.3390/languages10040065 - 28 Mar 2025
Cited by 1 | Viewed by 990
Abstract
Sociophonetic competence—a component of sociolinguistic and, thus, communicative competence—has been explored in both learner production and perception. Still, little is known about the relationship between learners’ ability to account for sociophonetic variability in the input and their likelihood to produce such variation in [...] Read more.
Sociophonetic competence—a component of sociolinguistic and, thus, communicative competence—has been explored in both learner production and perception. Still, little is known about the relationship between learners’ ability to account for sociophonetic variability in the input and their likelihood to produce such variation in output. The present study explores 21 learners’ preference for a specific sociophonetic variant on an aural preference task and the same learners’ patterns of production of the variant in semi-spontaneous speech. The sociolinguistic variable considered is Spanish intervocalic /d/, variably realized as approximant [ð] or deleted based on numerous (extra)linguistic factors, including the speaker’s gender, the vowel that precedes /d/, and the grammatical category and lexical frequency of the word containing /d/. Results reveal that preference for and production of a deleted variant increased with learner proficiency. Moreover, regardless of proficiency, learners generally selected deleted /d/ more than they produced it, suggesting that sociophonetic awareness precedes reliable production. Learners’ production of a deleted variant was influenced by the preceding vowel, the grammatical category of the word containing /d/, and the word’s lexical frequency, and sensitivity to these predictors was especially observed as proficiency increased. Learners produced the deleted variant more after /o/, in adjectives and nouns, and in frequent words. Full article
(This article belongs to the Special Issue The Acquisition of L2 Sociolinguistic Competence)
Show Figures

Figure 1

19 pages, 2315 KB  
Article
The Flexible Role of Social Experience in the Processing of Abstract Concepts
by Zhao Yao, Yu Chai and Xiaoli He
Behav. Sci. 2025, 15(2), 190; https://doi.org/10.3390/bs15020190 - 11 Feb 2025
Cited by 1 | Viewed by 1086
Abstract
Multiple representation theories propose that social experience plays an important role in grounding abstract concepts. However, it is less clear how social experience influences the processing of abstract concepts, especially whether this influence is modulated by emotional experience and task demands. To address [...] Read more.
Multiple representation theories propose that social experience plays an important role in grounding abstract concepts. However, it is less clear how social experience influences the processing of abstract concepts, especially whether this influence is modulated by emotional experience and task demands. To address this question, we orthogonally manipulated the socialness (high vs. low) and emotional valence (positive vs. negative vs. neutral) of abstract words in a lexical decision task (LDT, Experiment 1) and an emotional Stroop task (Experiment 2). Results show that the role of socialness in abstract concept processing was modulated by the concept’s emotional valence, with different patterns between the two tasks. Specifically, positive high-socialness (HS) words elicited slower responses than positive low-socialness (LS) words in the emotional Stroop task, but no such difference was observed in the LDT. In both tasks, however, faster responses were found for negative HS than for negative LS words, and no response differences were observed for neutral HS and LS words. These results provide behavioral evidence for the importance of social experience in the processing of abstract concepts and suggest that concept knowledge derived from social experiences interacts with concept knowledge derived from emotional experiences during lexical–semantic processing. This finding clarifies the heterogeneity of abstract concepts, with positive and negative social words constituting distinct subcategories, and confirms experience-based abstract concepts are inherently flexible, selectively combining with other associated embodied experiences based on task demands, thereby dynamically influencing abstract concept processing. Full article
Show Figures

Figure 1

28 pages, 1653 KB  
Article
Automatic Text Simplification for Lithuanian: Transforming Administrative Texts into Plain Language
by Justina Mandravickaitė, Eglė Rimkienė, Danguolė Kotryna Kapkan, Danguolė Kalinauskaitė, Antanas Čenys and Tomas Krilavičius
Mathematics 2025, 13(3), 465; https://doi.org/10.3390/math13030465 - 30 Jan 2025
Viewed by 2192
Abstract
In this study, we present the results of experiments on text simplification for the Lithuanian language, where we aim to simplify administrative-style texts to the Plain Language level. We selected mT5, mBART, and LT-Llama-2 as the foundational models and fine-tuned them for the [...] Read more.
In this study, we present the results of experiments on text simplification for the Lithuanian language, where we aim to simplify administrative-style texts to the Plain Language level. We selected mT5, mBART, and LT-Llama-2 as the foundational models and fine-tuned them for the text simplification task. Additionally, we evaluated ChatGPT for this purpose. Also, we conducted a comprehensive assessment of the simplification results provided by these models both quantitatively and qualitatively. The results demonstrated that mBART was the most effective model for simplifying Lithuanian administrative text, achieving the highest scores across all the evaluation metrics. A qualitative evaluation of the simplified sentences complemented our quantitative findings. Attention analysis provided insights into model decisions, highlighting strengths in lexical and syntactic simplifications but revealing challenges with longer, complex sentences. Our findings contribute to advancing text simplification for lesser-resourced languages, with practical applications for more effective communication between institutions and the general public, which is the goal of Plain Language. Full article
(This article belongs to the Section E1: Mathematics and Computer Science)
Show Figures

Figure 1

13 pages, 487 KB  
Systematic Review
The Functional Origin of Oral Word Production Deficits in the Logopenic Variant of Primary Progressive Aphasia: A Systematic Review
by Amra Hasanovic, Joël Macoir, Amélie Sanfaçon-Verret and Laura Monetta
Brain Sci. 2025, 15(2), 111; https://doi.org/10.3390/brainsci15020111 - 24 Jan 2025
Viewed by 1391
Abstract
Background/Objectives: Oral word production (OWP) deficits are prominent in the logopenic variant of primary progressive aphasia (lvPPA); however, their functional origin remains unclear. Some studies suggest a lexical, post-lexical, or even a combined functional origin of these deficits. The aim of the present [...] Read more.
Background/Objectives: Oral word production (OWP) deficits are prominent in the logopenic variant of primary progressive aphasia (lvPPA); however, their functional origin remains unclear. Some studies suggest a lexical, post-lexical, or even a combined functional origin of these deficits. The aim of the present study was to synthesize and analyze the information on the functional origin of the OWP deficits in patients with lvPPA. Methods: A quantitative systematic literature review was carried out using four databases: CINAHL, PsycINFO, Linguistics and Language Behavior Abstracts, and PubMed. Fourteen studies, including a total of 243 patients with lvPPA, and reporting results on picture naming and/or word and/or pseudoword repetition, were selected. Results: The overall findings of this review highlighted that two main functional origins appear to explain the OWP deficits in lvPPA: a lexical impairment affecting lexical processing and a post-lexical impairment affecting phonological short-term memory. Interestingly, the possibility of a third functional origin, affecting the semantic processing level, was also suggested by some studies. Conclusions: We concluded that the presence of different functional origins of OWP in this population may be explained, at least partially, by the diversity of assessment tasks used in studies and the varied manipulation and control of psycholinguistic properties of words (e.g., frequency, length), as well as the various interpretations and analyses of the participants’ errors. Further studies are needed to substantiate these findings by examining all the components involved in OWP, carefully manipulating the psycholinguistic properties and qualitatively analyzing the errors made by lvPPA participants. Full article
Show Figures

Figure 1

23 pages, 5944 KB  
Article
Examining Sentiment Analysis for Low-Resource Languages with Data Augmentation Techniques
by Gaurish Thakkar, Nives Mikelić Preradović and Marko Tadić
Eng 2024, 5(4), 2920-2942; https://doi.org/10.3390/eng5040152 - 7 Nov 2024
Cited by 1 | Viewed by 3300
Abstract
This investigation investigates the influence of a variety of data augmentation techniques on sentiment analysis in low-resource languages, with a particular emphasis on Bulgarian, Croatian, Slovak, and Slovene. The following primary research topic is addressed: is it possible to improve sentiment analysis efficacy [...] Read more.
This investigation investigates the influence of a variety of data augmentation techniques on sentiment analysis in low-resource languages, with a particular emphasis on Bulgarian, Croatian, Slovak, and Slovene. The following primary research topic is addressed: is it possible to improve sentiment analysis efficacy in low-resource languages through data augmentation? Our sub-questions look at how different augmentation methods affect performance, how effective WordNet-based augmentation is compared to other methods, and whether lemma-based augmentation techniques can be used, especially for Croatian sentiment tasks. The sentiment-labelled evaluations in the selected languages are included in our data sources, which were curated with additional annotations to standardise labels and mitigate ambiguities. Our findings show that techniques like replacing words with synonyms, masked language model (MLM)-based generation, and permuting and combining sentences can only make training datasets slightly bigger. However, they provide limited improvements in model accuracy for low-resource language sentiment classification. WordNet-based techniques, in particular, exhibit a marginally superior performance compared to other methods; however, they fail to substantially improve classification scores. From a practical perspective, this study emphasises that conventional augmentation techniques may require refinement to address the complex linguistic features that are inherent to low-resource languages, particularly in mixed-sentiment and context-rich instances. Theoretically, our results indicate that future research should concentrate on the development of augmentation strategies that introduce novel syntactic structures rather than solely relying on lexical variations, as current models may not effectively leverage synonymic or lemmatised data. These insights emphasise the nuanced requirements for meaningful data augmentation in low-resource linguistic settings and contribute to the advancement of sentiment analysis approaches. Full article
(This article belongs to the Special Issue Feature Papers in Eng 2024)
Show Figures

Figure 1

15 pages, 1156 KB  
Article
The Contribution of Cognitive Control Networks in Word Selection Processing in Parkinson’s Disease: Novel Insights from a Functional Connectivity Study
by Sonia Di Tella, Matteo De Marco, Isabella Anzuino, Davide Quaranta, Francesca Baglio and Maria Caterina Silveri
Brain Sci. 2024, 14(9), 913; https://doi.org/10.3390/brainsci14090913 - 11 Sep 2024
Viewed by 1674
Abstract
Parkinson’s disease (PD) patients are impaired in word production when the word has to be selected among competing alternatives requiring higher attentional resources. In PD, word selection processes are correlated with the structural integrity of the inferior frontal gyrus, which is critical for [...] Read more.
Parkinson’s disease (PD) patients are impaired in word production when the word has to be selected among competing alternatives requiring higher attentional resources. In PD, word selection processes are correlated with the structural integrity of the inferior frontal gyrus, which is critical for response selection, and the uncinate fasciculus, which is necessary for processing lexical information. In early PD, we investigated the role of the main cognitive large-scale networks, namely the salience network (SN), the central executive networks (CENs), and the default mode network (DMN), in word selection. Eighteen PD patients and sixteen healthy controls were required to derive nouns from verbs or generate verbs from nouns. Participants also underwent a resting-state functional MRI. Functional connectivity (FC) was examined using independent component analysis. Functional seeds for the SN, CENs, and DMN were defined as spheres, centered at the local activation maximum. Correlations were calculated between the FC of each functional seed and word production. A significant association between SN connectivity and task performance and, with less evidence, between CEN connectivity and the task requiring selection among a larger number of competitors, emerged in the PD group. These findings suggest the involvement of the SN and CEN in word selection in early PD, supporting the hypothesis of impaired executive control. Full article
Show Figures

Figure 1

17 pages, 1397 KB  
Article
A Comprehensive Analysis of Various Tokenizers for Arabic Large Language Models
by Faisal Qarah and Tawfeeq Alsanoosy
Appl. Sci. 2024, 14(13), 5696; https://doi.org/10.3390/app14135696 - 29 Jun 2024
Cited by 12 | Viewed by 7635
Abstract
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become [...] Read more.
Pretrained language models have achieved great success in various natural language understanding (NLU) tasks due to their capacity to capture deep contextualized information in text using pretraining on large-scale corpora. Tokenization plays a significant role in the process of lexical analysis. Tokens become the input for other natural language processing (NLP) tasks, like semantic parsing and language modeling. However, there is a lack of research on the evaluation of the impact of tokenization on the Arabic language model. Therefore, this study aims to address this gap in the literature by evaluating the performance of various tokenizers on Arabic large language models (LLMs). In this paper, we analyze the differences between WordPiece, SentencePiece, and BBPE tokenizers by pretraining three BERT models using each tokenizer while measuring the performance of each model on seven different NLP tasks using 29 different datasets. Overall, the model pretrained with text tokenized using the SentencePiece tokenizer significantly outperforms the other two models that utilize WordPiece and BBPE tokenizers. The results of this paper will assist researchers in developing better models, making better decisions in selecting the best tokenizers, improving feature engineering, and making models more efficient, thus ultimately leading to advancements in various NLP applications. Full article
Show Figures

Figure 1

21 pages, 637 KB  
Article
Function-Level Compilation Provenance Identification with Multi-Faceted Neural Feature Distillation and Fusion
by Yang Gao, Lunjin Liang, Yifei Li, Rui Li and Yu Wang
Electronics 2024, 13(9), 1692; https://doi.org/10.3390/electronics13091692 - 27 Apr 2024
Viewed by 1602
Abstract
In the landscape of software development, the selection of compilation tools and settings plays a pivotal role in the creation of executable binaries. This diversity, while beneficial, introduces significant challenges for reverse engineers and security analysts in deciphering the compilation provenance of binary [...] Read more.
In the landscape of software development, the selection of compilation tools and settings plays a pivotal role in the creation of executable binaries. This diversity, while beneficial, introduces significant challenges for reverse engineers and security analysts in deciphering the compilation provenance of binary code. To this end, we present MulCPI, short for Multi-representation Fusion-based Compilation Provenance Identification, which integrates the features collected from multiple distinct intermediate representations of the binary code for better discernment of the fine-grained function-level compilation details. In particular, we devise a novel graph-oriented neural encoder improved upon the gated graph neural network by subtly introducing an attention mechanism into the neighborhood nodes’ information aggregation computation, in order to better distill the more informative features from the attributed control flow graph. By further integrating the features collected from the normalized assembly sequence with an advanced Transformer encoder, MulCPI is capable of capturing a more comprehensive set of features manifesting the multi-faceted lexical, syntactic, and structural insights of the binary code. Extensive evaluation on a public dataset comprising 854,858 unique functions demonstrates that MulCPI exceeds the performance of current leading methods in identifying the compiler family, optimization level, compiler version, and the combination of compilation settings. It achieves average accuracy rates of 99.3%, 96.4%, 90.7%, and 85.3% on these tasks, respectively. Additionally, an ablation study highlights the significance of MulCPI’s core designs, validating the efficiency of the proposed attention-enhanced gated graph neural network encoder and the advantages of incorporating multiple code representations. Full article
(This article belongs to the Special Issue Machine Learning (ML) and Software Engineering, 2nd Edition)
Show Figures

Figure 1

Back to TopTop