A Novel Sequence-Based Feature for the Identification of DNA-Binding Sites in Proteins Using Jensen-Shannon Divergence
2016 | journal article. A publication with affiliation to the University of Göttingen.
Jump to: Cite & Linked | Documents & Media | Details | Version history
Documents & Media
Details
- Authors
- Dang, Truong Khanh Linh; Meckbach, Cornelia ; Tacke, Rebecca; Waack, Stephan; Gueltas, Mehmet
- Abstract
- The knowledge of protein-DNA interactions is essential to fully understand the molecular activities of life. Many research groups have developed various tools which are either structure-or sequence-based approaches to predict the DNA-binding residues in proteins. The structure-based methods usually achieve good results, but require the knowledge of the 3D structure of protein; while sequence-based methods can be applied to high-throughput of proteins, but require good features. In this study, we present a new information theoretic feature derived from Jensen-Shannon Divergence (JSD) between amino acid distribution of a site and the background distribution of non-binding sites. Our new feature indicates the difference of a certain site from a non-binding site, thus it is informative for detecting binding sites in proteins. We conduct the study with a five-fold cross validation of 263 proteins utilizing the Random Forest classifier. We evaluate the functionality of our new features by combining them with other popular existing features such as position-specific scoring matrix (PSSM), orthogonal binary vector (OBV), and secondary structure (SS). We notice that by adding our features, we can significantly boost the performance of Random Forest classifier, with a clear increment of sensitivity and Matthews correlation coefficient (MCC).
- Issue Date
- 2016
- Status
- published
- Publisher
- Mdpi Ag
- Journal
- Entropy
- ISSN
- 1099-4300
- Sponsor
- Open-Access-Publikationsfonds 2016