Saved Queries

Background/Objectives: The Janus kinase-signal transducer and activator of transcription (JAK-STAT) signaling pathway is a critical mediator of immune regulation, inflammation, and cancer progression. Although implicated in colorectal cancer (CRC) pathogenesis, its molecular heterogeneity and clinical significance remain insufficiently characterized—particularly within early-onset CRC (EOCRC) and across diverse treatment and demographic contexts. We present AI-HOPE-JAK-STAT, a novel conversational artificial intelligence platform built to enable the real-time, natural language-driven exploration of JAK/STAT pathway alterations in CRC. The platform integrates clinical, genomic, and treatment data to support dynamic, hypothesis-generating analyses for precision oncology. Methods: AI-HOPE-JAK-STAT combines large language models (LLMs), a natural language-to-code engine, and harmonized public CRC datasets from cBioPortal. Users define analytical queries in plain English, which are translated into executable code for cohort selection, survival analysis, odds ratio testing, and mutation profiling. To validate the platform, we replicated known associations involving JAK1, JAK3, and STAT3 mutations. Additional exploratory analyses examined age, treatment exposure, tumor stage, and anatomical site. Results: The platform recapitulated established trends, including improved survival among EOCRC patients with JAK/STAT pathway alterations. In FOLFOX-treated CRC cohorts, JAK/STAT-altered tumors were associated with significantly enhanced overall survival (p < 0.0001). Stratification by age revealed survival advantages in younger (age < 50) patients with JAK/STAT mutations (p = 0.0379). STAT5B mutations were enriched in colon adenocarcinoma and correlated with significantly more favorable trends (p = 0.0000). Conversely, JAK1 mutations in microsatellite-stable tumors did not affect survival, emphasizing the value of molecular context. Finally, JAK3-mutated tumors diagnosed at Stage I–III showed superior survival compared to Stage IV cases (p = 0.00001), reinforcing stage as a dominant clinical determinant. Conclusions: AI-HOPE-JAK-STAT establishes a new standard for pathway-level interrogation in CRC by empowering users to generate and test clinically meaningful hypotheses without coding expertise. This system enhances access to precision oncology analyses and supports the scalable, real-time discovery of survival trends, mutational associations, and treatment-response patterns across stratified patient cohorts. Full article

(This article belongs to the Special Issue AI-Based Applications in Cancers)

►▼ Show Figures

Figure 1

18 pages, 957 KiB

Open AccessArticle

CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus

by Peng Ye, Yujin Jiang and Yadi Wang

Information 2025, 16(7), 610; https://doi.org/10.3390/info16070610 - 16 Jul 2025

Abstract

Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Traditional toponym recognition methods based on part-of-speech tagging only focus on the surface-level features of words, failing to effectively handle complex scenarios such as alias nesting, metonymy ambiguity, and mixed punctuation. This leads to the loss of toponym semantic integrity and deviations in geographic entity recognition. This study proposes a set of Chinese toponym annotation specifications that integrate spatial semantics. By leveraging the XML markup language, it deeply combines the spatial location characteristics of toponyms with linguistic features, and designs fine-grained annotation rules to address the limitations of traditional methods in semantic integrity and geographic entity recognition. On this basis, by integrating multi-source corpora from the Encyclopedia of China: Chinese Geography and People’s Daily, a large-scale Chinese toponym annotation corpus (CHTopo) covering five major categories of toponyms has been constructed. The performance of this annotated corpus was evaluated through toponym recognition, exploring the construction methods of a large-scale, diversified, and high-coverage Chinese toponym annotated corpus from the perspectives of applicability and practicality. CHTopo is conducive to providing foundational support for geographic information extraction, spatial knowledge graphs, and geoparsing research, bridging linguistic and geospatial intelligence. Full article

(This article belongs to the Special Issue Text Mining: Challenges, Algorithms, Tools and Applications)

►▼ Show Figures

Figure 1

18 pages, 319 KiB

Open AccessArticle

Influence of Short Novels on Creation of Educational Programs in Literature: Taking A.P. Chekhov’s “The Chameleon” and Lu Xun’s “A Madman’s Diary” as Examples

by Yuhang Xin and Saule Bayazovna Begaliyeva

Educ. Sci. 2025, 15(7), 906; https://doi.org/10.3390/educsci15070906 (registering DOI) - 16 Jul 2025

Abstract

This study explores how artificial intelligence (AI) technologies can be theoretically integrated into literature curriculum development, using the works of Anton Chekhov and Lu Xun as illustrative case texts. The aim is to reduce barriers to language and cultural understanding in literature education and increase the efficiency and accessibility of cross-cultural teaching. We used natural language processing (NLP) techniques to analyze textual features, such as readability index, lexical density, and syntactic complexity, of AI-generated and human-translated “The Chameleon” and “A Madman’s Diary”. Teaching cases from universities in China, Russia, and Kazakhstan are reviewed to assess the emerging practice of AI-supported literature teaching. The proposed theoretical framework draws on hermeneutics, posthumanism, and cognitive load theories. The results of the data-driven analysis suggest that AI-assisted translation tends to simplify sentence structure and improve surface readability. While anecdotal classroom observations highlight the role of AI in initial comprehension, deeper literary interpretation still relies on teacher guidance and critical human engagement. This study introduces a conceptual “AI Literature Teaching Model” that positions AI as a cognitive and cultural mediator and outlines directions for future empirical validation. Full article

27 pages, 3562 KiB

Open AccessArticle

Automated Test Generation and Marking Using LLMs

by Ioannis Papachristou, Grigoris Dimitroulakos and Costas Vassilakis

Electronics 2025, 14(14), 2835; https://doi.org/10.3390/electronics14142835 - 15 Jul 2025

Viewed by 177

Abstract

This paper presents an innovative exam-creation and grading system powered by advanced natural language processing and local large language models. The system automatically generates clear, grammatically accurate questions from both short passages and longer documents across different languages, supports multiple formats and difficulty levels, and ensures semantic diversity while minimizing redundancy, thus maximizing the percentage of the material that is covered in the generated exam paper. For grading, it employs a semantic-similarity model to evaluate essays and open-ended responses, awards partial credit, and mitigates bias from phrasing or syntax via named entity recognition. A major advantage of the proposed approach is its ability to run entirely on standard personal computers, without specialized artificial intelligence hardware, promoting privacy and exam security while maintaining low operational and maintenance costs. Moreover, its modular architecture allows the seamless swapping of models with minimal intervention, ensuring adaptability and the easy integration of future improvements. A requirements–compliance evaluation, combined with established performance metrics, was used to review and compare two popular multilingual LLMs and monolingual alternatives, demonstrating the system’s effectiveness and flexibility. The experimental results show that the system achieves a grading accuracy within a 17% normalized error margin compared to that of human experts, with generated questions reaching up to 89.5% semantic similarity to source content. The full exam generation and grading pipeline runs efficiently on consumer-grade hardware, with average inference times under 30 s. Full article

(This article belongs to the Special Issue Innovations in NLP and Large Language Models: Shaping the Future of AI)

►▼ Show Figures

Figure 1

21 pages, 1118 KiB

Open AccessReview

Integrating Large Language Models into Robotic Autonomy: A Review of Motion, Voice, and Training Pipelines

by Yutong Liu, Qingquan Sun and Dhruvi Rajeshkumar Kapadia

AI 2025, 6(7), 158; https://doi.org/10.3390/ai6070158 - 15 Jul 2025

Viewed by 224

Abstract

This survey provides a comprehensive review of the integration of large language models (LLMs) into autonomous robotic systems, organized around four key pillars: locomotion, navigation, manipulation, and voice-based interaction. We examine how LLMs enhance robotic autonomy by translating high-level natural language commands into low-level control signals, supporting semantic planning and enabling adaptive execution. Systems like SayTap improve gait stability through LLM-generated contact patterns, while TrustNavGPT achieves a 5.7% word error rate (WER) under noisy voice-guided conditions by modeling user uncertainty. Frameworks such as MapGPT, LLM-Planner, and 3D-LOTUS++ integrate multi-modal data—including vision, speech, and proprioception—for robust planning and real-time recovery. We also highlight the use of physics-informed neural networks (PINNs) to model object deformation and support precision in contact-rich manipulation tasks. To bridge the gap between simulation and real-world deployment, we synthesize best practices from benchmark datasets (e.g., RH20T, Open X-Embodiment) and training pipelines designed for one-shot imitation learning and cross-embodiment generalization. Additionally, we analyze deployment trade-offs across cloud, edge, and hybrid architectures, emphasizing latency, scalability, and privacy. The survey concludes with a multi-dimensional taxonomy and cross-domain synthesis, offering design insights and future directions for building intelligent, human-aligned robotic systems powered by LLMs. Full article

►▼ Show Figures

Figure 1

21 pages, 1620 KiB

Open AccessArticle

Guiding the Unseen: A Systems Model of Prompt-Driven Agency Dynamics in Generative AI-Enabled VR Serious Game Design

by Chenhan Jiang, Shengyu Huang and Tao Shen

Systems 2025, 13(7), 576; https://doi.org/10.3390/systems13070576 - 12 Jul 2025

Viewed by 263

Abstract

Generative Artificial Intelligence (GenAI)-assisted Virtual Reality (VR) heritage serious game design constitutes a complex adaptive socio-technical system in which natural language prompts act as control levers shaping designers’ cognition and action. However, the systemic effects of prompt type on agency construction, decision boundaries, and process strategy remain unclear. Treating the design setting as adaptive, we captured real-time interactions by collecting think-aloud data from 48 novice designers. Nine prompt categories were extracted and their cognitive effects were systematically analyzed through the Repertory Grid Technique (RGT), principal component analysis (PCA), and Ward clustering. These analyses revealed three perception profiles: tool-based, collaborative, and mentor-like. Strategy coding of 321 prompt-aligned utterances showed cluster-specific differences in path length, first moves, looping, and branching. Tool-based prompts reinforced boundary control through short linear refinements; collaborative prompts sustained moderate iterative enquiry cycles; mentor-like prompts triggered divergent exploration via self-loops and frequent jumps. We therefore propose a stage-adaptive framework that deploys mentor-like prompts for ideation, collaborative prompts for mid-phase iteration, and tool-based prompts for final verification. This approach balances creativity with procedural efficiency and offers a reusable blueprint for integrating prompt-driven agency modelling into GenAI design workflows. Full article

►▼ Show Figures

Figure 1

450 KiB

Open AccessProceeding Paper

Methodology for Automatic Information Extraction and Summary Generation from Online Sources for Project Funding

by Mariya Zhekova

Eng. Proc. 2025, 100(1), 44; https://doi.org/10.3390/engproc2025100044 (registering DOI) - 11 Jul 2025

Abstract

The summarized content of one or more extensive text documents helps users extract only the most important key information, instead of reviewing and reading hundreds of pages of text. This study uses extractive and abstractive mechanisms to automatically extract and summarize information retrieved from various web documents on the same topic. The research aims to develop a methodology for designing and developing an information system for pre- and post-processing natural language obtained through web content search and web scraping, and for the automatic generation of a summary of the retrieved text. The research outlines two subtasks. As a first step, the system is designed to collect and process up-to-date information based on specific criteria from diverse web resources related to project funding, initiated by various organizations such as startups, sustainable companies, municipalities, government bodies, schools, the NGO sector, and others. As a second step, the collected extensive textual information about current projects and programs, which is typically intended for financial professionals, is to be summarized into a shorter version and transformed into a suitable format for a wide range of non-specialist users. The automated AI software tool, which will be developed using the proposed methodology, will be able to crawl and read project funding information from various web documents, select, process, and prepare a shortened version containing only the most important key information for its clients. Full article

(This article belongs to the Proceedings of The 14th International Scientific Conference TechSys 2025—Engineering, Technologies and Systems)

►▼ Show Figures

Figure 1

16 pages, 2741 KiB

Open AccessArticle

EVOCA: Explainable Verification of Claims by Graph Alignment

by Carmela De Felice, Carmelo Fabio Longo, Misael Mongiovì, Daniele Francesco Santamaria and Giusy Giulia Tuccari

Information 2025, 16(7), 597; https://doi.org/10.3390/info16070597 - 11 Jul 2025

Viewed by 162

Abstract

The paper introduces EVOCA—Explainable Verification Of Claims by Graph Alignment—a hybrid approach that combines NLP (Natural Language Processing) techniques with the structural advantages of knowledge graphs to manage and reduce the amount of evidence required to evaluate statements. The approach leverages the explicit and interpretable structure of semantic graphs, which naturally represent the semantic structure of a sentence—or a set of sentences—and explicitly encodes the relationships among different concepts, thereby facilitating the extraction and manipulation of relevant information. The primary objective of the proposed tool is to condense the evidence into a short sentence that preserves only the salient and relevant information of the target claim. This process eliminates superfluous and redundant information, which could impact the performance of the subsequent verification task and provide useful information to explain the outcome. To achieve this, the proposed tool called EVOCA—Explainable Verification Of Claims by Graph Alignment—generates a sub-graph in AMR (Abstract Meaning Representation), representing the tokens of the claim–evidence pair that exhibit high semantic similarity. The structured representation offered by the AMR graph not only aids in identifying the most relevant information but also improves the interpretability of the results. The resulting sub-graph is converted back into natural language with the SPRING AMR tool, producing a concise but meaning-rich “sub-evidence” sentence. The output can be processed by lightweight language models to determine whether the evidence supports, contradicts, or is neutral about the claim. The approach is tested on the 4297 sentence pairs of the Climate-BERT-fact-checking dataset, and the promising results are discussed. Full article

(This article belongs to the Collection Natural Language Processing and Applications: Challenges and Perspectives)

►▼ Show Figures

Figure 1

18 pages, 797 KiB

Open AccessArticle

A Digital Sustainability Lens: Investigating Medical Students’ Adoption Intentions for AI-Powered NLP Tools in Learning Environments

by Mostafa Aboulnour Salem

Sustainability 2025, 17(14), 6379; https://doi.org/10.3390/su17146379 - 11 Jul 2025

Viewed by 258

Abstract

This study investigates medical students’ intentions to adopt AI-powered Natural Language Processing (NLP) tools (e.g., ChatGPT, Copilot) within educational contexts aligned with the perceived requirements of digital sustainability. Based on the Unified Theory of Acceptance and Use of Technology (UTAUT), data were collected from 301 medical students in Saudi Arabia and analyzed using Partial Least Squares Structural Equation Modelling (PLS-SEM). The results indicate that Performance Expectancy (PE) (β = 0.65), Effort Expectancy (EE) (β = 0.58), and Social Influence (SI) (β = 0.53) collectively and significantly predict Behavioral Intention (BI), explicating 62% of the variance in BI (R² = 0.62). AI awareness did not significantly influence students’ responses or the relationships among constructs, possibly because practical familiarity and widespread exposure to AI-NLP tools exert a stronger influence than general awareness. Moreover, BI exhibited a strong positive effect on perceptions of digital sustainability (PDS) (β = 0.72, R² = 0.51), highlighting a meaningful link between AI adoption and sustainable digital practices. Consequently, these findings indicate the strategic role of AI-driven NLP tools as both educational innovations and key enablers of digital sustainability, aligning with global frameworks such as the Sustainable Development Goals (SDGs) 4 and 9. The study also concerns AI’s transformative potential in medical education and recommends further research, particularly longitudinal studies, to better understand the evolving impact of AI awareness on students’ adoption behaviours. Full article

(This article belongs to the Special Issue AI for Sustainable Development: Applications and Impacts across Industries)

►▼ Show Figures

Graphical abstract

40 pages, 7773 KiB

Open AccessArticle

A Novel Llama 3-Based Prompt Engineering Platform for Textual Data Generation and Labeling

by Wedyan Salem Alsakran and Reham Alabduljabbar

Electronics 2025, 14(14), 2800; https://doi.org/10.3390/electronics14142800 - 11 Jul 2025

Viewed by 240

Abstract

With the growing demand for labeled textual data in Natural Language Processing (NLP), traditional data collection and annotation methods face significant challenges, such as high cost, limited scalability, and privacy constraints. This study presents a novel web-based platform that automates text data generation and labeling by integrating Llama 3.3, an open-source large language model (LLM), with advanced prompt engineering techniques. A core contribution of this work is the Attributed Prompt Engineering Framework, which enables modular and configurable prompt templates for both data generation and labeling tasks. This framework combines zero-shot, few-shot, role-based, and chain-of-thought prompting strategies within a unified architecture to optimize output quality and control. Users can interactively configure prompt parameters and generate synthetic datasets or annotate raw data with minimal human intervention. We evaluated the platform using both benchmark datasets (AG News, Yelp, Amazon Reviews) and two fully synthetic datasets we generated (restaurant reviews and news articles). The system achieved 99% accuracy and F1-score on generated news article data, 98% accuracy and F1-score on generated restaurant review data, and 92%, 90%, and 89% accuracy and F1-scores on the benchmark labeling tasks for AG News, Yelp Reviews, and Amazon Reviews, respectively, demonstrating high effectiveness and generalizability. A usability study also confirmed the platform’s practicality for non-expert users. This work advances scalable NLP data pipeline design and provides a cost-effective alternative to manual annotation for supervised learning applications. Full article

(This article belongs to the Special Issue Advanced Natural Language Processing Technology and Applications)

►▼ Show Figures

Figure 1

23 pages, 2718 KiB

Open AccessArticle

Chinese Tourist Motivations for Hokkaido, Japan: A Hybrid Approach Using Transformer Models and Statistical Methods

by Zhenzhen Liu, Juuso Eronen, Fumito Masui and Michal Ptaszynski

Tour. Hosp. 2025, 6(3), 133; https://doi.org/10.3390/tourhosp6030133 - 11 Jul 2025

Viewed by 223

Abstract

The COVID-19 pandemic severely impacted Japan’s inbound tourism, but recent recovery trends highlight the growing importance of Chinese tourists. Understanding their motivations is crucial for revitalizing the industry. Building on our previous framework, this study applies Transformer-based natural language processing (NLP) models and principal component analysis (PCA) to analyze large-scale user-generated content (UGC) and identify key motivational factors influencing Chinese tourists’ visits to Hokkaido. Traditional survey-based approaches to tourism motivation research often suffer from response biases and small sample sizes. In contrast, we leverage a pre-trained Transformer model, RoBERTa, to score motivational factors like self-expansion, excitement, and cultural observation. PCA is subsequently used to extract the most significant factors across different destinations. Findings indicate that Chinese tourists are primarily drawn to Hokkaido’s natural scenery and cultural experiences, and the differences in these factors by season. While the model effectively aligns with manual scoring, it shows limitations in capturing more abstract motivations such as excitement and self-expansion. This research advances tourism analytics by applying AI-driven methodologies, offering practical insights for destination marketing and management. Future work can extend this approach to other regions and cross-cultural contexts, further enhancing AI’s role in understanding evolving traveler preferences. Full article

►▼ Show Figures

Figure 1

28 pages, 549 KiB

Open AccessReview

Large Language Models for Knowledge Graph Embedding: A Survey

by Bingchen Liu, Yuanyuan Fang, Naixing Xu, Shihao Hou, Xin Li and Qian Li

Mathematics 2025, 13(14), 2244; https://doi.org/10.3390/math13142244 - 10 Jul 2025

Viewed by 256

Abstract

Large language models (LLMs) have attracted a lot of attention in various fields due to their superior performance, aiming to train hundreds of millions or more parameters on large amounts of text data to understand and generate natural language. As the superior performance of LLMs becomes apparent, they are increasingly being applied to knowledge graph embedding (KGE)-related tasks to improve the processing results. Traditional KGE representation learning methods map entities and relations into a low-dimensional vector space, enabling the triples in the knowledge graph to satisfy a specific scoring function in the vector space. However, based on the powerful language understanding and semantic modeling capabilities of LLMs, which have recently been invoked to varying degrees in different types of KGE-related scenarios such as multi-modal KGE and open KGE according to their task characteristics, researchers are increasingly exploring how to integrate LLMs to enhance knowledge representation, improve generalization to unseen entities or relations, and support reasoning beyond static graph structures. In this paper, we investigate a wide range of approaches for performing LLMs-related tasks in different types of KGE scenarios. To better compare the various approaches, we summarize each KGE scenario in a classification. In the article we also discuss the applications in which the methods are mainly used and suggest several forward-looking directions for the development of this new research area. Full article

(This article belongs to the Special Issue Data-Driven Decentralized Learning for Future Communication Networks)

►▼ Show Figures

Figure 1

23 pages, 1621 KiB

Open AccessArticle

Analyzing Higher Education Students’ Prompting Techniques and Their Impact on ChatGPT’s Performance: An Exploratory Study in Spanish

by José Luis Carrasco-Sáez, Carolina Contreras-Saavedra, Sheny San-Martín-Quiroga, Carla E. Contreras-Saavedra and Rhoddy Viveros-Muñoz

Appl. Sci. 2025, 15(14), 7651; https://doi.org/10.3390/app15147651 - 8 Jul 2025

Viewed by 570

Abstract

Generative artificial intelligence is reshaping how people interact with digital technologies, emphasizing the need to develop effective skills for engaging with it. In this context, prompt engineering has emerged as a critical skill for optimizing AI-generated outputs. However, research on how higher education students interact with these technologies remains limited, particularly in non-English-speaking contexts. This exploratory study examines how 102 higher education students in Chile formulated prompts in Spanish and how their techniques influenced the responses generated by ChatGPT (free version 3.5). A quantitative analysis was conducted to assess the relationship between prompt techniques and response quality. Two emergent prompt engineering strategies were identified: the Guide Contextualization Strategy and the Specific Purpose Strategy. The Guide Contextualization Strategy focused on providing explicit contextual information to guide ChatGPT’s responses, aligning with few-shot prompting, while the Specific Purpose Strategy emphasized defining the request’s purpose, aligning with structured objective formulation strategies. The regression analysis indicated that the Guide Contextualization Strategy had a greater impact on response quality, reinforcing the importance of contextual information in effective interactions with large language models. As an exploratory study, these findings provide preliminary evidence on prompt engineering strategies in Spanish, a relatively unexplored area in artificial intelligence education research. Based on these results, a methodological framework is proposed, encompassing four key dimensions: grammatical skills; prompt strategies; response from the large language model; and evaluation of response quality. This framework lays the groundwork for future artificial intelligence digital literacy interventions, fostering critical and effective engagement with generative artificial intelligence while also highlighting the need for further research to validate and expand these initial insights. Full article

(This article belongs to the Special Issue Techniques and Applications of Natural Language Processing)

►▼ Show Figures

Figure 1

22 pages, 796 KiB

Open AccessArticle

BIMCoder: A Comprehensive Large Language Model Fusion Framework for Natural Language-Based BIM Information Retrieval

by Bingru Liu and Hainan Chen

Appl. Sci. 2025, 15(14), 7647; https://doi.org/10.3390/app15147647 - 8 Jul 2025

Viewed by 181

Abstract

Building Information Modeling (BIM) has excellent potential to enhance building operation and maintenance. However, as a standardized data format in the architecture, engineering, and construction (AEC) industry, the retrieval of BIM information generally requires specialized software. Cumbersome software operations prevent its effective application in the actual operation and management of buildings. This paper presents BIMCoder, a model designed to translate natural language queries into structured query statements compatible with professional BIM software (e.g., BIMserver v1.5). It serves as an intermediary component between users and various BIM platforms, facilitating access for users without specialized BIM knowledge. A dedicated BIM information query dataset was constructed, comprising 1680 natural language query and structured BIM query string pairs, categorized into 12 groups. Three classical pre-trained large language models (LLMs) (ERNIE 3.0, Llama-13B, and SQLCoder) were evaluated on this dataset. A fine-tuned model based on SQLCoder was then trained. Subsequently, a fusion model (BIMCoder) integrating ERNIE and SQLCoder was designed. Test results demonstrate that the proposed BIMCoder model achieves an outstanding accurate matching rate of 87.16% and an Execution Accuracy rate of 88.75% for natural language-based BIM information retrieval. This study confirms the feasibility of natural language-based BIM information retrieval and offers a novel solution to reduce the complexity of BIM system interaction. Full article

►▼ Show Figures

Figure 1

39 pages, 8177 KiB

Open AccessArticle

Unveiling Epigenetic Regulatory Elements Associated with Breast Cancer Development

by Marta Jardanowska-Kotuniak, Michał Dramiński, Michal Wlasnowolski, Marcin Łapiński, Kaustav Sengupta, Abhishek Agarwal, Adam Filip, Nimisha Ghosh, Vera Pancaldi, Marcin Grynberg, Indrajit Saha, Dariusz Plewczynski and Michał J. Dąbrowski

Int. J. Mol. Sci. 2025, 26(14), 6558; https://doi.org/10.3390/ijms26146558 - 8 Jul 2025

Viewed by 428

Abstract

Breast cancer affects over 2 million women annually and results in 650,000 deaths. This study aimed to identify epigenetic mechanisms impacting breast cancer-related gene expression, discover potential biomarkers, and present a novel approach integrating feature selection, Natural Language Processing, and 3D chromatin structure analysis. We used The Cancer Genome Atlas database with over 800 samples and multi-omics datasets (mRNA, miRNA, DNA methylation) to select 2701 features statistically significant in cancer versus control samples, from an initial 417,486, using the Monte Carlo Feature Selection and Interdependency Discovery algorithm. Classification of cancer vs. control samples on the selected features returned very high accuracy, depending on feature-type and classifier. The cancer samples generally showed lower expression of differentially expressed genes (DEGs) and increased β-values of differentially methylated sites (DMSs). We identified mRNAs whose expression is explained by miRNA expression and β-values of DMSs. We recognized DMSs affecting NRF1 and MXI1 transcription factors binding, causing a disturbance in NKAPL and PITX1 expression, respectively. Our 3D models showed more loosely packed chromatin in cancer. This study highlights numerous possible regulatory dependencies, and the presented bioinformatic approach provides a robust framework for data dimensionality reduction, enabling the identification of key features for further experimental validation. Full article

(This article belongs to the Section Molecular Oncology)

►▼ Show Figures

Figure 1

Show export options Show export options

Select all

Export citation of selected articles as:

Error

Oops... you haven't selected anything for export.

Displaying article 1-50 on page 1 of 27.

Go to page 1 2 3 4 5

Search Results (1,342)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI