PMID- 33297954 OWN - NLM STAT- MEDLINE DCOM- 20201223 LR - 20201223 IS - 1471-2105 (Electronic) IS - 1471-2105 (Linking) VI - 21 IP - 1 DP - 2020 Dec 9 TI - iDPGK: characterization and identification of lysine phosphoglycerylation sites based on sequence-based features. PG - 568 LID - 10.1186/s12859-020-03916-5 [doi] LID - 568 AB - BACKGROUND: Protein phosphoglycerylation, the addition of a 1,3-bisphosphoglyceric acid (1,3-BPG) to a lysine residue of a protein and thus to form a 3-phosphoglyceryl-lysine, is a reversible and non-enzymatic post-translational modification (PTM) and plays a regulatory role in glucose metabolism and glycolytic process. As the number of experimentally verified phosphoglycerylated sites has increased significantly, statistical or machine learning methods are imperative for investigating the characteristics of phosphoglycerylation sites. Currently, research into phosphoglycerylation is very limited, and only a few resources are available for the computational identification of phosphoglycerylation sites. RESULT: We present a bioinformatics investigation of phosphoglycerylation sites based on sequence-based features. The TwoSampleLogo analysis reveals that the regions surrounding the phosphoglycerylation sites contain a high relatively of positively charged amino acids, especially in the upstream flanking region. Additionally, the non-polar and aliphatic amino acids are more abundant surrounding phosphoglycerylated lysine following the results of PTM-Logo, which may play a functional role in discriminating between phosphoglycerylation and non-phosphoglycerylation sites. Many types of features were adopted to build the prediction model on the training dataset, including amino acid composition, amino acid pair composition, positional weighted matrix and position-specific scoring matrix. Further, to improve the predictive power, numerous top features ranked by F-score were considered as the final combination for classification, and thus the predictive models were trained using DT, RF and SVM classifiers. Evaluation by five-fold cross-validation showed that the selected features was most effective in discriminating between phosphoglycerylated and non-phosphoglycerylated sites. CONCLUSION: The SVM model trained with the selected sequence-based features performed well, with a sensitivity of 77.5%, a specificity of 73.6%, an accuracy of 74.9%, and a Matthews Correlation Coefficient value of 0.49. Furthermore, the model also consistently provides the effective performance in independent testing set, yielding sensitivity of 75.7% and specificity of 64.9%. Finally, the model has been implemented as a web-based system, namely iDPGK, which is now freely available at http://mer.hc.mmh.org.tw/iDPGK/ . FAU - Huang, Kai-Yao AU - Huang KY AD - Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu City 300, Taiwan. AD - Department of Medicine, Mackay Medical College, New Taipei City 252, Taiwan. FAU - Hung, Fang-Yu AU - Hung FY AD - Department of Obstetrics and Gynecology, Hsinchu Mackay Memorial Hospital, Hsinchu City 300, Taiwan. FAU - Kao, Hui-Ju AU - Kao HJ AD - Department of Medical Research, Hsinchu Mackay Memorial Hospital, Hsinchu City 300, Taiwan. FAU - Lau, Hui-Hsuan AU - Lau HH AD - Department of Medicine, Mackay Medical College, New Taipei City 252, Taiwan. huihsuan1220@gmail.com. AD - Department of Obstetrics and Gynecology, Hsinchu Mackay Memorial Hospital, Hsinchu City 300, Taiwan. huihsuan1220@gmail.com. AD - Department of Obstetrics and Gynecology, Mackay Memorial Hospital, Taipei City 104, Taiwan. huihsuan1220@gmail.com. FAU - Weng, Shun-Long AU - Weng SL AD - Department of Medicine, Mackay Medical College, New Taipei City 252, Taiwan. 4467@mmh.org.tw. AD - Department of Obstetrics and Gynecology, Hsinchu Mackay Memorial Hospital, Hsinchu City 300, Taiwan. 4467@mmh.org.tw. AD - Mackay Junior College of Medicine, Medicine, Nursing and Management College, Taipei City 112, Taiwan. 4467@mmh.org.tw. LA - eng GR - MOST109-2320-B-195-001/Ministry of Science and Technology, Taiwan/ PT - Journal Article DEP - 20201209 PL - England TA - BMC Bioinformatics JT - BMC bioinformatics JID - 100965194 RN - 0 (Proteins) RN - K3Z4F929H6 (Lysine) SB - IM MH - Amino Acid Sequence MH - Computational Biology/*methods MH - Glycosylation MH - Internet MH - Lysine/chemistry/*metabolism MH - Machine Learning MH - Position-Specific Scoring Matrices MH - Protein Processing, Post-Translational MH - Proteins/chemistry MH - ROC Curve MH - Reproducibility of Results MH - *Software MH - Support Vector Machine PMC - PMC7727188 OTO - NOTNLM OT - 3-Phosphoglyceryl-lysine (pgK) OT - Post-translational modification (PTM) OT - Protein phosphoglycerylation OT - Sequence-based features COIS- The authors have declared that no competing interests exist. EDAT- 2020/12/11 06:00 MHDA- 2020/12/29 06:00 PMCR- 2020/12/09 CRDT- 2020/12/10 05:35 PHST- 2020/08/10 00:00 [received] PHST- 2020/11/30 00:00 [accepted] PHST- 2020/12/10 05:35 [entrez] PHST- 2020/12/11 06:00 [pubmed] PHST- 2020/12/29 06:00 [medline] PHST- 2020/12/09 00:00 [pmc-release] AID - 10.1186/s12859-020-03916-5 [pii] AID - 3916 [pii] AID - 10.1186/s12859-020-03916-5 [doi] PST - epublish SO - BMC Bioinformatics. 2020 Dec 9;21(1):568. doi: 10.1186/s12859-020-03916-5.