Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (392)

Search Parameters:
Keywords = corpus construction

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 1232 KB  
Article
Disaster Emotion: When Media Messages Emphasize Self-Interested Responses
by Soyoung Kim, Christopher Stream and Suyeon Lee
Behav. Sci. 2026, 16(4), 621; https://doi.org/10.3390/bs16040621 - 21 Apr 2026
Viewed by 65
Abstract
Media coverage of disasters frequently frames self-interested behavior in contrast to collective responsibility and coordinated response. This study aims to explore how such behavior is emotionally constructed in disaster-related media, using a carefully selected corpus of 12 text-centered news articles focusing on selfish [...] Read more.
Media coverage of disasters frequently frames self-interested behavior in contrast to collective responsibility and coordinated response. This study aims to explore how such behavior is emotionally constructed in disaster-related media, using a carefully selected corpus of 12 text-centered news articles focusing on selfish behavior. The analysis combines transformer-based sentence-level emotion classification using the tweetnlp RoBERTa model, which predicts 11 emotion categories, with Latent Dirichlet Allocation topic modeling across single-sentence and three-sentence windows in a small purposively selected corpus. Emotion–topic relationships are quantified by weighting emotion probabilities by topic distributions and visualized using bar charts, network graphs, and heatmaps. The findings suggest that fear and disgust dominate portrayals of self-interested behavior, while anticipation appears in projections of harm and anger is linked to inequality and institutional accountability. Two discursive configurations emerge: Responsibility Across Individuals and Institutions, emphasizing public accountability and authority, and Collective Fear and Self-Protective Practices, reflecting affect-driven responses under uncertainty. Although negative emotions predominate, optimism appears conditionally, signaling coordination and recovery. Overall, disaster reporting constructs selfishness through integrated emotional–semantic patterns that position individual actions within broader social risk and collective responsibility. Full article
44 pages, 2312 KB  
Article
Classification Model of Emotional Tone in Hate Speech and Its Relationship with Inequality and Gender Stereotypes, Using NLP and Machine Learning Algorithms
by Aymé Escobar Díaz, Ricardo Rivadeneira, Walter Fuertes and Washington Loza
Future Internet 2026, 18(4), 218; https://doi.org/10.3390/fi18040218 - 20 Apr 2026
Viewed by 103
Abstract
Hate speech on social media reproduces norms of inequality and gender stereotypes, disproportionately affecting women. This study proposes a hybrid approach that integrates emotional tone classification with explicit hostility detection to strengthen preventive moderation. We constructed a corpus from three open data sets [...] Read more.
Hate speech on social media reproduces norms of inequality and gender stereotypes, disproportionately affecting women. This study proposes a hybrid approach that integrates emotional tone classification with explicit hostility detection to strengthen preventive moderation. We constructed a corpus from three open data sets (1,236,371 records; 1,003,991 after ETL) and represented the text using TF-IDF and contextual RoBERTa embeddings. We trained individual models (RoBERTa fine-tuned, Random Forest, and XGBoost) and a stacking metamodel (Gradient Boosting) that combines their probabilities. On the test set, the ensemble outperformed the base classifiers, achieving accuracy of 0.93 in hate detection and 0.90 in emotion classification, with an AUC of 0.98 for emotion classification. We implemented a RESTful API and a web client to validate the moderation flow before publication, along with an administration panel for auditing. Performance tests in a prototype deployment (Google Colab exposed through an Ngrok tunnel) provided proof-of-concept validation, revealing concurrency limitations from around 300 users due to infrastructure constraints. In general, the results indicate that incorporating emotional tone analysis improves the model’s ability to identify implicit hostility and offers a practical way to promote safer digital environments. The probabilistic outputs produced by the ensemble model were subsequently analyzed using the Bayesian Calibration and Optimal Design under Asymmetric Risk (BACON-AR) framework, which serves as a mathematical post hoc decision layer for evaluating classification behaviour under unequal error costs. Rather than modifying the trained architecture or improving its predictive performance, the framework identifies a cost-sensitive operating threshold that minimizes the total expected risk under the selected asymmetric cost configuration. The experiments were conducted using an English-language data set; therefore, the findings of this study are limited to hate speech detection in English. Full article
(This article belongs to the Section Techno-Social Smart Systems)
Show Figures

Graphical abstract

30 pages, 1706 KB  
Article
Understanding the Global Trends of 2025 Through the Defly Compass Methodology
by Mabel López Bordao, Antonia Ferrer Sapena, Carlos A. Reyes Pérez and Enrique A. Sánchez Pérez
Big Data Cogn. Comput. 2026, 10(4), 124; https://doi.org/10.3390/bdcc10040124 - 17 Apr 2026
Viewed by 451
Abstract
This study aims to identify and synthesize the major global trends that shaped 2025 by applying the DeflyCompass methodology to a curated corpus of strategic foresight reports. The study synthesizes insights from 23 strategic reports published by leading international organizations, including the World [...] Read more.
This study aims to identify and synthesize the major global trends that shaped 2025 by applying the DeflyCompass methodology to a curated corpus of strategic foresight reports. The study synthesizes insights from 23 strategic reports published by leading international organizations, including the World Economic Forum, Accenture, Euromonitor, and major technology firms. Methodologically, DeflyCompass operationalizes a structured hybrid human–AI pipeline comprising the deployment of multi-agent AI systems, automated knowledge graph construction, semantic clustering, and hybrid human–AI validation processes, reducing an initial set of 816 preliminary signals to a validated catalog of 50 high-priority trends across six PESTEL domains: Political, Economic, Social, Technological, Environmental, and Legal/Governance. Key findings indicate that artificial intelligence functions as a systemic enabling technology across all domains, climate and sustainability imperatives permeate multiple domains, geopolitical fragmentation introduces systemic tension, and trust deficits emerge as a critical vulnerability. The study contributes a replicable and scalable framework for global-level strategic foresight that operationalizes human–AI integration within a rigorous expert-driven validation process, complementing existing hybrid analytical approaches in the literature. Implications extend to decision-making in technology governance, sustainability strategy, social adaptation, and scenario planning, highlighting the necessity of integrating AI augmentation with human expertise for effective future-oriented planning. Full article
Show Figures

Graphical abstract

23 pages, 944 KB  
Article
When Perception Becomes Discourse: The Case of en/por lo que toca a in Spanish
by Miriam Heila Reyes Núñez
Languages 2026, 11(4), 79; https://doi.org/10.3390/languages11040079 - 15 Apr 2026
Viewed by 576
Abstract
This study examines the diachronic development of Spanish perception verbs into deverbal topic markers (DTMs), focusing on tocar (‘to touch’), e.g., en/por lo que toca a, as representative of sensory perception. While the grammaticalization of visual perception verbs into discourse markers (DMs) [...] Read more.
This study examines the diachronic development of Spanish perception verbs into deverbal topic markers (DTMs), focusing on tocar (‘to touch’), e.g., en/por lo que toca a, as representative of sensory perception. While the grammaticalization of visual perception verbs into discourse markers (DMs) has been extensively documented, sensory verbs remain understudied. Drawing on data from three electronic corpora—CORDIAM, CORDE, and CORPES—this paper traces the semantic and syntactic evolution of these constructions from the 15th to the 21st century. There are three main conclusions: (a) the semantic development of tocar (‘to touch’) is driven by the interaction of metonymy and metaphor, corresponding to a process of metaphtonymy; (b) en/por lo que toca a arises through gradual grammaticalization processes, including semantic bleaching, decategorialization, increase in scope, and a positional shift toward the left periphery; (c) the corpus evidence suggests a gradual diffusion of the construction across textual genres, beginning in legal and administrative texts and later spreading to other registers. Full article
(This article belongs to the Special Issue Recent Developments on the Semantics of Perception Verbs)
Show Figures

Figure 1

42 pages, 3547 KB  
Article
Light Verbs and Syntactic Analyzability in the History of the Galician Language
by Alexandre Rodríguez Guerra
Languages 2026, 11(4), 78; https://doi.org/10.3390/languages11040078 - 15 Apr 2026
Viewed by 350
Abstract
This contribution studies the behavior of the four main general light verbs (LVs) in the history of Galician (dar, facer and haber/ter). The research is structured around the following three fundamental axes: first, we study the evolution, comparison [...] Read more.
This contribution studies the behavior of the four main general light verbs (LVs) in the history of Galician (dar, facer and haber/ter). The research is structured around the following three fundamental axes: first, we study the evolution, comparison with equivalent full verbs and the morphosyntactic behavior of 26 different LV constructions (with examples that the literature identifies with different degrees of fixation) from medieval to contemporary Galician, all of which form a corpus with 8728 occurrences. Next, we discuss the results of a survey distributed to 162 respondents, which allows an assessment of these LVs from several perspectives, especially syntactic. Finally, we offer an original proposal to measure the degree of syntactic analyzability, based on the quantified review of the various parameters analyzed (of which we also provide a scale, applied synchronically and diachronically) and the results in the specific survey question. We call it Syntactic Analyzability Index (SAI) and, thanks to this index, we obtain an objective scale that projects each example on a gradation that explains the greater or lesser distance or proximity of a construction with LV from freely combined elements or from the most fixed of phrasemes. Full article
Show Figures

Figure 1

37 pages, 2011 KB  
Review
Quantum-Safe Blockchain: Mapping Research Fronts in Post-Quantum Cryptography, Quantum Threat Models, and QKD Integration
by Félix Díaz, Nhell Cerna, Rafael Liza and Bryan Motta
Computers 2026, 15(4), 240; https://doi.org/10.3390/computers15040240 - 14 Apr 2026
Viewed by 441
Abstract
Quantum computing challenges the long-term security assumptions of blockchain systems that rely on classical public-key cryptography, motivating the adoption of post-quantum cryptography and quantum key distribution (QKD). This review maps research fronts at the intersection of blockchain and quantum-safe security, linking threat assumptions [...] Read more.
Quantum computing challenges the long-term security assumptions of blockchain systems that rely on classical public-key cryptography, motivating the adoption of post-quantum cryptography and quantum key distribution (QKD). This review maps research fronts at the intersection of blockchain and quantum-safe security, linking threat assumptions to post-quantum mechanisms, blockchain layers, and QKD positioning. Records were retrieved from Scopus and Web of Science using a two-block query and filtered through a PRISMA-guided workflow for bibliometric mapping. The final corpus comprises 648 journal articles and shows accelerated publication growth after 2023, with scientific production concentrated in a small set of leading countries. Keyword structures indicate that IoT-centric deployments dominate the semantic backbone, where authentication and intelligent methods co-occur with blockchain security primitives, while post-quantum and privacy-preserving constructs form a cohesive technical stream. QKD appears as a distinct but more specialized theme, typically discussed at the system level and shaped by infrastructure and scalability constraints. Overall, the literature is moving from conceptual risk articulation toward engineering integration; however, progress is limited by inconsistent reporting of threat models, post-quantum parameter sets, and ledger-level cost trade-offs, highlighting the need for auditable and reproducible evaluation. Full article
Show Figures

Figure 1

26 pages, 841 KB  
Article
LLM-Assisted Weak Supervision for Low-Resource Kazakh Sequence Labeling: Synthetic Annotation and CRF-Refined NER/POS Models
by Aigerim Aitim
Appl. Sci. 2026, 16(8), 3632; https://doi.org/10.3390/app16083632 - 8 Apr 2026
Viewed by 297
Abstract
Kazakh sequence labeling is constrained by limited annotated resources, while its agglutinative morphology and productive suffixation increase data sparsity and exacerbate label inconsistency in part-of-speech (POS) tagging and named entity recognition (NER). This paper proposes an LLM-assisted weak supervision framework in which a [...] Read more.
Kazakh sequence labeling is constrained by limited annotated resources, while its agglutinative morphology and productive suffixation increase data sparsity and exacerbate label inconsistency in part-of-speech (POS) tagging and named entity recognition (NER). This paper proposes an LLM-assisted weak supervision framework in which a large language model generates synthetic token-level annotations that are subsequently filtered using confidence-based criteria and combined with a smaller manually verified subset to train Transformer-based sequence taggers with Conditional Random Field (CRF) decoding. The pipeline unifies corpus construction, weak-label generation, quality filtering, word-to-subword alignment, and CRF-refined structured prediction into a reproducible workflow. Experimental results show that contextual encoders and structured decoding provide strong performance for Kazakh POS and NER, while the proposed training design enables efficient convergence with diminishing returns beyond moderate epoch budgets. Error-slice analysis indicates that residual errors are concentrated in rare tokens, morphologically complex long words, longer sentences, and the ORG entity class. Overall, the findings support the use of LLM-assisted weak supervision as a scalable strategy for low-resource Kazakh sequence labeling when synthetic labels are controlled through filtering and refined by structured decoding. Full article
Show Figures

Figure 1

31 pages, 2043 KB  
Systematic Review
Mapping and Auditing Evidence in Digital Storytelling for Industrial Heritage Transformation: A Focused Systematic Review (2011–2026)
by Xin Bian, André Brown and Bruno Marques
Sustainability 2026, 18(7), 3630; https://doi.org/10.3390/su18073630 - 7 Apr 2026
Viewed by 292
Abstract
This study presents a focused review of digital storytelling research in industrial heritage using a bounded Scopus-indexed corpus covering the period from 2011 to February 2026. It examines whether regeneration-relevant interpretive claims in urban renewal contexts are supported by traceable research structures. As [...] Read more.
This study presents a focused review of digital storytelling research in industrial heritage using a bounded Scopus-indexed corpus covering the period from 2011 to February 2026. It examines whether regeneration-relevant interpretive claims in urban renewal contexts are supported by traceable research structures. As post-industrial landscapes undergo restoration and urban redevelopment, digital storytelling is frequently used to frame issues of memory, responsibility, and heritage legitimacy; however, the evidentiary basis of such claims remains insufficiently scrutinized. Adopting an outcome-traceability perspective, the study evaluates whether interpretation-related outcomes are supported by traceable links between mechanisms, constructs, measurement approaches, and evaluation design. A two-stage synthesis is conducted: Stage 1 provides a bibliometric profile of the Scopus-indexed corpus, revealing a fragmented publication landscape dominated by conference papers and prototype-oriented studies, while Stage 2 audits evidence chains across the screened analytical studies to assess whether commonly cited mechanisms, such as narrative meaning-making, affective engagement, and interactive exploration, are operationalised into explicit constructs and measurable indicators. Findings suggest that reported outcomes most frequently concentrate on immediate experiential responses, while higher-level outcomes such as awareness, attitudes, and learning are less consistently supported by robust evaluation designs. The study identifies recurring traceability gaps and outlines priorities for improving evidentiary consistency and comparability in industrial heritage digital storytelling research. Full article
(This article belongs to the Section Social Ecology and Sustainability)
Show Figures

Figure 1

26 pages, 1451 KB  
Article
LDA Analysis of Institutional Policy Texts: A Case Study of Regulations on the Protection of Historical and Cultural Cities, Towns, and Villages in China
by Zongcheng Hu and Li Shao
Information 2026, 17(4), 350; https://doi.org/10.3390/info17040350 - 7 Apr 2026
Viewed by 308
Abstract
Against the backdrop of a multi-tiered governance system and increasingly institutionalized norms, China’s historical and cultural preservation policies have long emphasized institutional standardization and hierarchical uniformity. Local policy texts are typically viewed as localized replicas of central institutional logic, overlooking internal variations and [...] Read more.
Against the backdrop of a multi-tiered governance system and increasingly institutionalized norms, China’s historical and cultural preservation policies have long emphasized institutional standardization and hierarchical uniformity. Local policy texts are typically viewed as localized replicas of central institutional logic, overlooking internal variations and differences in information structure. Accordingly, this study examines the Regulations on the Protection of Historical and Cultural Cities, Towns, and Villages issued by 13 provincial-level administrative regions in China. It conceptualizes provincial regulatory texts as institutionalized policy information systems, constructs a cross-regional corpus, and develops a comparative information structure analytical framework based on the Latent Dirichlet Allocation (LDA) topic model. This study operationalizes LDA-derived topic-weight distributions into a comparative analytical framework that captures structural prominence, dispersion, concentration, and priority hierarchy in provincial policy texts. The findings reveal that provincial-level historical and cultural preservation regulations in China exhibit a highly institutionalized information backbone, centered on administrative procedures, legal norms, and macro-level planning controls, and demonstrate significant institutional similarity across provinces. However, within this unified institutional framework, provinces exhibit structural differences in the distribution of thematic weights, information prioritization, and internal textual sequencing, resulting in multiple distinguishable information organization patterns. Consequently, this study highlights the coexistence of formal institutional uniformity and structural differentiation in provincial regulatory texts, providing a more precise basis for understanding variation in local policy expression within China’s historical and cultural governance field. Full article
(This article belongs to the Section Information Theory and Methodology)
Show Figures

Figure 1

23 pages, 1329 KB  
Systematic Review
Knowledge-Informed Technology-Enabled Asset Management and Compliance Assurance in Construction: A Systematic Grey Literature Review
by Alhadi Alsaffar, Thomas Beach and Yacine Rezgui
Buildings 2026, 16(7), 1434; https://doi.org/10.3390/buildings16071434 - 4 Apr 2026
Viewed by 426
Abstract
Digital transformation is reshaping construction asset compliance, but fragmented information and weak evidence trails still constrain effective management. This systematic grey literature review (2014–2025) identifies technologies supporting asset management and compliance assurance and compares adoption maturity across the United Kingdom (UK), the United [...] Read more.
Digital transformation is reshaping construction asset compliance, but fragmented information and weak evidence trails still constrain effective management. This systematic grey literature review (2014–2025) identifies technologies supporting asset management and compliance assurance and compares adoption maturity across the United Kingdom (UK), the United States (US), Singapore, and the Gulf Cooperation Council (GCC). Using multi-channel search strategies and the AACODS appraisal (Authority, Accuracy, Coverage, Objectivity, Date, Significance), 131 records were identified; 92 full texts reviewed; 82 eligible; and 43 sources retained. Coding identified a recurring five-technology “core digital stack”: Building Information Modelling (BIM), Digital Twins (DT), Internet of Things (IoT), Artificial Intelligence/Machine Learning (AI/ML), and Blockchain (BC). Within the retained corpus, BIM and AI/ML were the most frequently referenced technologies, whereas BC was referenced more selectively and discussed mainly for tamper-evident traceability. DT and IoT were typically discussed alongside BIM, while IoT also frequently co-occurred with AI/ML in analytics-led compliance workflows. A (Region × Technology) maturity matrix suggests higher, policy-led maturity where mandates and audit-ready information align with national frameworks (UK, Singapore), and more uneven, project-led adoption in decentralised contexts (US, GCC). Overall, the findings emphasise that effective compliance relies on integrated, evidence-focused digital stacks supported by standardised information governance rather than isolated tools. Full article
Show Figures

Figure 1

19 pages, 357 KB  
Data Descriptor
Scrabbling Syllables into Words: Wordlikeness Norms for European Portuguese Auditory Pseudowords
by Ana Paula Soares, Alberto Lema, Diana R. Pereira, Ana Cláudia Rodrigues, Vinicius Canonici and Helena M. Oliveira
Data 2026, 11(4), 76; https://doi.org/10.3390/data11040076 - 3 Apr 2026
Viewed by 340
Abstract
Auditory pseudowords are widely used in psycholinguistics and cognitive neuroscience, but their construction requires control of sublexical familiarity and careful characterization of how acoustic cue manipulations may shift perceived lexical plausibility. Here we introduce the Minho Pseudoword Wordlikeness Ratings (MPWR), the first normative [...] Read more.
Auditory pseudowords are widely used in psycholinguistics and cognitive neuroscience, but their construction requires control of sublexical familiarity and careful characterization of how acoustic cue manipulations may shift perceived lexical plausibility. Here we introduce the Minho Pseudoword Wordlikeness Ratings (MPWR), the first normative dataset of wordlikeness judgments for European Portuguese (EP) auditory trisyllabic CV pseudowords, and evaluate whether adding a localized F0-based prominence cue modulates wordlikeness beyond distributional familiarity. One hundred and twenty pseudowords were assembled from naturally produced syllables drawn from the Minho Spoken Syllable Pool (MSSP) and recorded under uniform conditions. Each item was implemented in three token types with constant segmental content: a flat baseline and two F0-enhanced versions (+15%) targeting either the penultimate or final syllable. Native EP listeners (N = 101) provided wordlikeness ratings on a 7-point scale. MSSP-derived indices quantified pseudoword syllable familiarity (SWIAll, SWIN3) and stress-position propensity for the targeted syllable (SPPmarked). Ratings were intentionally low overall yet showed substantial item-to-item variability. F0 enhancement produced a small but reliable decrease in wordlikeness relative to flat tokens, with no reliable difference between penultimate and final targeting positions. SWIAll robustly predicted ratings, whereas SPPmarked added little explanatory value. MPWR provides a practical EP resource for selecting and matching auditory pseudowords using normative wordlikeness ratings and transparent corpus-based descriptors. Full article
(This article belongs to the Section Featured Reviews of Data Science Research)
23 pages, 596 KB  
Article
Perceived Cognitive Assistance in LLM-Augmented Retail Trading: Construct Definition and Content Validation
by Dmitrii Gimmelberg and Iveta Ludviga
Int. J. Financial Stud. 2026, 14(4), 83; https://doi.org/10.3390/ijfs14040083 - 1 Apr 2026
Viewed by 474
Abstract
Large language models (LLMs) are increasingly used by retail traders to interpret information and design complex strategies, yet existing adoption constructs do not capture the decision-time experience of being cognitively scaffolded by an LLM. We define Perceived Cognitive Assistance (PCA) as the trader’s [...] Read more.
Large language models (LLMs) are increasingly used by retail traders to interpret information and design complex strategies, yet existing adoption constructs do not capture the decision-time experience of being cognitively scaffolded by an LLM. We define Perceived Cognitive Assistance (PCA) as the trader’s felt expansion of cognitive capability at the moment of a trading decision when an LLM is available, and we report initial content validation of a PCA item pool. Study 1 specified the PCA content domain using a two-tier qualitative corpus (eight interviews and 44 YouTube narratives on LLM-assisted trading, plus 24 qualitative and mixed-method studies on robo-advice and social trading). Reflexive thematic analysis yielded five facilitative assistance facets and one adjacent risk facet (over-reliance), and these were translated into a 16-item PCA pool. Study 2 used a naïve-judge sort-and-rate task with 48 retail traders to test whether items show definitional correspondence to PCA and definitional distinctiveness from similar constructs: perceived usefulness, perceived ease of use, trust in the LLM, and trading self-efficacy. The resulting nine-item set is ready for subsequent factor-analytic and predictive validation. This study advances our understanding of how large language models shape retail trading behaviour by identifying and empirically grounding Perceived Cognitive Assistance as the decision-time psychological experience through which LLMs cognitively scaffold traders, clarifying how LLM use differs from generic technology adoption, trust, or self-efficacy effects. Full article
Show Figures

Graphical abstract

19 pages, 1855 KB  
Article
Clinically Aligned Long-Context Transformers for Cross-Platform Mental Health Risk Detection
by Aditya Tekale and Mohammad Masum
Electronics 2026, 15(7), 1403; https://doi.org/10.3390/electronics15071403 - 27 Mar 2026
Viewed by 320
Abstract
Social media platforms contain rich but noisy narratives of psychological distress, creating opportunities for early mental health risk detection. However, existing datasets capture heterogeneous constructs such as suicide risk severity, depression diagnosis, and DSM-5 symptom presence, and most prior models are trained and [...] Read more.
Social media platforms contain rich but noisy narratives of psychological distress, creating opportunities for early mental health risk detection. However, existing datasets capture heterogeneous constructs such as suicide risk severity, depression diagnosis, and DSM-5 symptom presence, and most prior models are trained and evaluated on a single corpus, limiting their clinical alignment and cross-dataset generalizability. In this study, we fine-tune a domain-specific long-document transformer, AIMH/Mental-Longformer-base-4096, for binary mental health risk detection (risk vs. no risk) using two clinically aligned Reddit datasets: the C-SSRS Reddit corpus and the eRisk 2025 depression dataset. To handle long user histories, we introduce an LLM-based summarization pipeline that compresses posts exceeding 2000 tokens while preserving mental health-relevant information. We also conduct a seven-configuration ablation study across combinations of three corpora (C-SSRS, eRisk, and ReDSM5) to examine how dataset semantics influence model performance. On a held-out C-SSRS + eRisk test set (n = 279), the proposed model achieves a mean balanced accuracy of 0.89 ± 0.01 across five random seeds, with a best run of 0.90 and a 5.74 percentage point improvement over the strongest baseline (TF-IDF + Random Forest). The model also shows strong cross-platform generalization, achieving BA = 0.78 on the depression-reddit-cleaned dataset (n = 7731) and BA = 0.85 (ROC-AUC = 0.92) on a Twitter suicidal-intention dataset (n = 9119) without additional fine-tuning. The ablation analysis shows that although a three-dataset configuration (C-SSRS + eRisk + ReDSM5) maximizes aggregate performance, the ReDSM5 labels encode symptom presence rather than clinical risk, creating a semantic mismatch. This finding highlights the importance of label compatibility when combining heterogeneous mental health corpora. Explainability analysis using Integrated Gradients and attention visualization shows that the model focuses on clinically meaningful expressions such as therapy references, diagnosis, and hopelessness rather than isolated keywords. These results demonstrate that clinically aligned long-context transformers can provide accurate and interpretable mental health risk detection from social media while emphasizing the critical role of dataset semantics in multi-corpus training. Full article
(This article belongs to the Special Issue Role of Artificial Intelligence in Natural Language Processing)
Show Figures

Figure 1

26 pages, 2135 KB  
Article
Mapping Research Trends in Road Safety: A Topic Modeling Perspective
by Iulius Alexandru Tudor and Florin Gîrbacia
Vehicles 2026, 8(4), 69; https://doi.org/10.3390/vehicles8040069 - 27 Mar 2026
Viewed by 573
Abstract
Over the past decade, road safety research has experienced rapid development due to the rapid expansion of large crash databases, the adoption of artificial intelligence techniques, and the demand for proactive and predictive safety solutions. This study conducts a data-driven review of recent [...] Read more.
Over the past decade, road safety research has experienced rapid development due to the rapid expansion of large crash databases, the adoption of artificial intelligence techniques, and the demand for proactive and predictive safety solutions. This study conducts a data-driven review of recent research trends in transport safety. It focuses on main domains including crash severity analysis, human factors, vulnerable road users (VRUs), spatial modeling, and artificial intelligence applications. A systematic search of the Scopus database identified 15,599 relevant scientific papers published between 2016 and 2025. After constructing this corpus, titles, abstracts, and keywords were preprocessed using a natural language pipeline. The analysis employed BERTopic, a transformer-based topic modeling framework. The analysis identified 29 distinct research topics, further synthesized into five major thematic areas: (1) crash severity and injury analysis, (2) driver behavior and human factors, (3) vulnerable road users, (4) artificial intelligence, machine learning, and computer vision in intelligent transportation systems, and (5) spatial analysis and hotspot detection. A notable increase in publications related to artificial intelligence and machine learning has been evident since 2020. The results show a transition from descriptive, post-crash studies to integrated, multimodal, predictive analysis. Overall, the findings reveal a paradigm shift in the field. This study also identifies ethical and economic issues associated with the use of artificial intelligence in intelligent transportation systems, including data management, infrastructure requirements, system security, and model transparency. The results signify a transition from intuition-based models to explainable, spatially explicit, and data-intensive models, ultimately facilitating proactive risk assessment and informed decision-making. Full article
(This article belongs to the Special Issue Intelligent Mobility and Sustainable Automotive Technologies)
Show Figures

Figure 1

21 pages, 340 KB  
Article
(Doing) Computational History: The Role of Data Work in Computational Approaches
by Sarah A. Lang
Histories 2026, 6(2), 26; https://doi.org/10.3390/histories6020026 - 27 Mar 2026
Viewed by 676
Abstract
Computational methods have become increasingly prominent within the historical sciences, generating significant enthusiasm among some scholars. Yet their practical demands, epistemic limits, and ethical implications are less often critically examined than praised. This article explores what it means to do computational history today, [...] Read more.
Computational methods have become increasingly prominent within the historical sciences, generating significant enthusiasm among some scholars. Yet their practical demands, epistemic limits, and ethical implications are less often critically examined than praised. This article explores what it means to do computational history today, arguing that it is not primarily defined by algorithms but by datasets. It is methodologically specific, resource-intensive, selective in scope, labour-heavy, and dependent on pre-digitised sources, specialised infrastructure, and interdisciplinary collaboration. These dependencies limit the scope of research questions and can produce narrow outcomes despite substantial effort, lending some validity to the concern over whether the field yields sufficient historiographical return for the labour invested. Corpus construction and data work lie at the epistemic core of computational history. These often undervalued tasks are not merely technical precursors to analysis, but interpretive and epistemic acts. Data are shaped by digitisation politics, historical bias, and institutional power. They shape the questions asked, the answers produced, and the legitimacy of findings. Recognising and valuing data work is essential, both to embed critical perspectives into computational humanities and to counteract the privileging of certain forms of labour over others. Due to the association of quantification with rigour and scholarly prowess, algorithmic work receives more credit, creating a two-tier system in this division of labour in which those who develop algorithms are elevated above those who curate data, despite their symbiotic interdependence. Computational history, when done well, requires deep engagement with our sources, be they historical or data. For computational history to stabilise as a meaningful discipline, it must prioritise building better datasets over pursuing increasingly complex algorithms on an unstable basis of data. Full article
(This article belongs to the Section Digital and Computational History)
Back to TopTop