Next Article in Journal
Concordant Gene Expression and Alternative Splicing Regulation under Abiotic Stresses in Arabidopsis
Previous Article in Journal
Presentation of Rare Phenotypes Associated with the FKBP10 Gene
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
Article

Prediction of Protein‒DNA Interface Hot Spots Based on Empirical Mode Decomposition and Machine Learning

1
School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei 230036, China
2
Anhui Provincial Engineering Laboratory for Beidou Precision Agriculture Information, Anhui Agricultural University, Hefei 230036, China
3
Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei 230088, China
*
Author to whom correspondence should be addressed.
Genes 2024, 15(6), 676; https://doi.org/10.3390/genes15060676
Submission received: 27 April 2024 / Revised: 20 May 2024 / Accepted: 22 May 2024 / Published: 23 May 2024
(This article belongs to the Section Bioinformatics)

Abstract

Protein–DNA complex interactivity plays a crucial role in biological activities such as gene expression, modification, replication and transcription. Understanding the physiological significance of protein–DNA binding interfacial hot spots, as well as the development of computational biology, depends on the precise identification of these regions. In this paper, a hot spot prediction method called EC-PDH is proposed. First, we extracted features of these hot spots’ solid solvent-accessible surface area (ASA) and secondary structure, and then the mean, variance, energy and autocorrelation function values of the first three intrinsic modal components (IMFs) of these conventional features were extracted as new features via the empirical modal decomposition algorithm (EMD). A total of 218 dimensional features were obtained. For feature selection, we used the maximum correlation minimum redundancy sequence forward selection method (mRMR-SFS) to obtain an optimal 11-dimensional-feature subset. To address the issue of data imbalance, we used the SMOTE-Tomek algorithm to balance positive and negative samples and finally used cat gradient boosting (CatBoost) to construct our hot spot prediction model for protein‒DNA binding interfaces. Our method performs well on the test set, with AUC, MCC and F1 score values of 0.847, 0.543 and 0.772, respectively. After a comparative evaluation, EC-PDH outperforms the existing state-of-the-art methods in identifying hot spots.
Keywords: hot spots; protein‒DNA; EMD; CatBoost hot spots; protein‒DNA; EMD; CatBoost

Share and Cite

MDPI and ACS Style

Fang, Z.; Li, Z.; Li, M.; Yue, Z.; Li, K. Prediction of Protein‒DNA Interface Hot Spots Based on Empirical Mode Decomposition and Machine Learning. Genes 2024, 15, 676. https://doi.org/10.3390/genes15060676

AMA Style

Fang Z, Li Z, Li M, Yue Z, Li K. Prediction of Protein‒DNA Interface Hot Spots Based on Empirical Mode Decomposition and Machine Learning. Genes. 2024; 15(6):676. https://doi.org/10.3390/genes15060676

Chicago/Turabian Style

Fang, Zirui, Zixuan Li, Ming Li, Zhenyu Yue, and Ke Li. 2024. "Prediction of Protein‒DNA Interface Hot Spots Based on Empirical Mode Decomposition and Machine Learning" Genes 15, no. 6: 676. https://doi.org/10.3390/genes15060676

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop