PMID- 33329723 OWN - NLM STAT- PubMed-not-MEDLINE LR - 20240330 IS - 1664-8021 (Print) IS - 1664-8021 (Electronic) IS - 1664-8021 (Linking) VI - 11 DP - 2020 TI - A Novel XGBoost Method to Identify Cancer Tissue-of-Origin Based on Copy Number Variations. PG - 585029 LID - 10.3389/fgene.2020.585029 [doi] LID - 585029 AB - The discovery of cancer of unknown primary (CUP) is of great significance in designing more effective treatments and improving the diagnostic efficiency in cancer patients. In the study, we develop an appropriate machine learning model for tracing the tissue of origin of CUP with high accuracy after feature engineering and model evaluation. Based on a copy number variation data consisting of 4,566 training cases and 1,262 independent validation cases, an XGBoost classifier is applied to 10 types of cancer. Extremely randomized tree (Extra tree) is used for dimension reduction so that fewer variables replace the original high-dimensional variables. Features with top 300 weights are selected and principal component analysis is applied to eliminate noise. We find that XGBoost classifier achieves the highest overall accuracy of 0.8913 in the 10-fold cross-validation for training samples and 0.7421 on independent validation datasets for predicting tumor tissue of origin. Furthermore, by contrasting various performance indices, such as precision and recall rate, the experimental results show that XGBoost classifier significantly improves the classification performance of various tumors with less prediction error, as compared to other classifiers, such as K-nearest neighbors (KNN), Bayes, support vector machine (SVM), and Adaboost. Our method can infer tissue of origin for the 10 cancer types with acceptable accuracy in both cross-validation and independent validation data. It may be used as an auxiliary diagnostic method to determine the actual clinicopathological status of specific cancer. CI - Copyright (c) 2020 Zhang, Feng, Wang, Dong, Yang, Su and Wang. FAU - Zhang, Yulin AU - Zhang Y AD - College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, China. FAU - Feng, Tong AU - Feng T AD - College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao, China. FAU - Wang, Shudong AU - Wang S AD - College of Computer and Communication Engineering, China University of Petroleum (East China), Qingdao, China. FAU - Dong, Ruyi AU - Dong R AD - Geneis (Beijing) Co., Ltd., Beijing, China. FAU - Yang, Jialiang AU - Yang J AD - Geneis (Beijing) Co., Ltd., Beijing, China. FAU - Su, Jionglong AU - Su J AD - School of AI and Advanced Computing, XJTLU Entrepreneur College (Taicang), Xi'an Jiaotong-Liverpool University, Suzhou, China. FAU - Wang, Bo AU - Wang B AD - Geneis (Beijing) Co., Ltd., Beijing, China. LA - eng PT - Journal Article DEP - 20201120 PL - Switzerland TA - Front Genet JT - Frontiers in genetics JID - 101560621 PMC - PMC7716814 OTO - NOTNLM OT - XGBoost OT - copy number variations OT - extremely randomized tree OT - multiclass OT - principal component analysis OT - tissue-of-origin EDAT- 2020/12/18 06:00 MHDA- 2020/12/18 06:01 PMCR- 2020/11/20 CRDT- 2020/12/17 05:51 PHST- 2020/07/19 00:00 [received] PHST- 2020/10/05 00:00 [accepted] PHST- 2020/12/17 05:51 [entrez] PHST- 2020/12/18 06:00 [pubmed] PHST- 2020/12/18 06:01 [medline] PHST- 2020/11/20 00:00 [pmc-release] AID - 10.3389/fgene.2020.585029 [doi] PST - epublish SO - Front Genet. 2020 Nov 20;11:585029. doi: 10.3389/fgene.2020.585029. eCollection 2020.