PMID- 29474353 OWN - NLM STAT- MEDLINE DCOM- 20180718 LR - 20190307 IS - 1553-7404 (Electronic) IS - 1553-7390 (Print) IS - 1553-7390 (Linking) VI - 14 IP - 2 DP - 2018 Feb TI - Deep sequencing of HBV pre-S region reveals high heterogeneity of HBV genotypes and associations of word pattern frequencies with HCC. PG - e1007206 LID - 10.1371/journal.pgen.1007206 [doi] LID - e1007206 AB - Hepatitis B virus (HBV) infection is a common problem in the world, especially in China. More than 60-80% of hepatocellular carcinoma (HCC) cases can be attributed to HBV infection in high HBV prevalent regions. Although traditional Sanger sequencing has been extensively used to investigate HBV sequences, NGS is becoming more commonly used. Further, it is unknown whether word pattern frequencies of HBV reads by Next Generation Sequencing (NGS) can be used to investigate HBV genotypes and predict HCC status. In this study, we used NGS to sequence the pre-S region of the HBV sequence of 94 HCC patients and 45 chronic HBV (CHB) infected individuals. Word pattern frequencies among the sequence data of all individuals were calculated and compared using the Manhattan distance. The individuals were grouped using principal coordinate analysis (PCoA) and hierarchical clustering. Word pattern frequencies were also used to build prediction models for HCC status using both K-nearest neighbors (KNN) and support vector machine (SVM). We showed the extremely high power of analyzing HBV sequences using word patterns. Our key findings include that the first principal coordinate of the PCoA analysis was highly associated with the fraction of genotype B (or C) sequences and the second principal coordinate was significantly associated with the probability of having HCC. Hierarchical clustering first groups the individuals according to their major genotypes followed by their HCC status. Using cross-validation, high area under the receiver operational characteristic curve (AUC) of around 0.88 for KNN and 0.92 for SVM were obtained. In the independent data set of 46 HCC patients and 31 CHB individuals, a good AUC score of 0.77 was obtained using SVM. It was further shown that 3000 reads for each individual can yield stable prediction results for SVM. Thus, another key finding is that word patterns can be used to predict HCC status with high accuracy. Therefore, our study shows clearly that word pattern frequencies of HBV sequences contain much information about the composition of different HBV genotypes and the HCC status of an individual. FAU - Bai, Xin AU - Bai X AUID- ORCID: 0000-0002-3755-0730 AD - Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China. AD - Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China. AD - Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, California, United States of America. FAU - Jia, Jian-An AU - Jia JA AD - Department of Laboratory Medicine, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, China. AD - Department of Laboratory Medicine, the 105th Hospital of PLA, Hefei, China. FAU - Fang, Meng AU - Fang M AD - Department of Laboratory Medicine, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, China. FAU - Chen, Shipeng AU - Chen S AD - Department of Laboratory Medicine, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, China. FAU - Liang, Xiaotao AU - Liang X AD - Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China. AD - School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China. FAU - Zhu, Shanfeng AU - Zhu S AD - School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Fudan University, Shanghai, China. FAU - Zhang, Shuqin AU - Zhang S AD - Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China. AD - Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China. AD - Shanghai Key Laboratory for Comtemporary Applied Mathematics, Fudan University, Shanghai, China. FAU - Feng, Jianfeng AU - Feng J AD - Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China. AD - Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China. AD - Department of Computer Science, University of Warwick, Coventry, United Kingodm. FAU - Sun, Fengzhu AU - Sun F AUID- ORCID: 0000-0002-8552-043X AD - Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, China. AD - Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China. AD - Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, California, United States of America. FAU - Gao, Chunfang AU - Gao C AD - Department of Laboratory Medicine, Eastern Hepatobiliary Surgery Hospital, Second Military Medical University, Shanghai, China. LA - eng GR - R01 GM120624/GM/NIGMS NIH HHS/United States GR - R01GM120624/NH/NIH HHS/United States PT - Journal Article PT - Research Support, N.I.H., Extramural PT - Research Support, Non-U.S. Gov't DEP - 20180223 PL - United States TA - PLoS Genet JT - PLoS genetics JID - 101239074 RN - 0 (DNA, Viral) RN - 0 (Hepatitis B Surface Antigens) RN - 0 (Protein Precursors) RN - 0 (presurface protein 1, hepatitis B surface antigen) RN - 0 (presurface protein 2, hepatitis B surface antigen) SB - IM MH - Carcinoma, Hepatocellular/epidemiology/genetics/*virology MH - DNA Fingerprinting MH - DNA, Viral/analysis MH - Gene Frequency MH - Genetic Association Studies/methods MH - *Genetic Heterogeneity MH - Genotype MH - Hepatitis B Surface Antigens/*genetics MH - Hepatitis B virus/classification/*genetics MH - Hepatitis B, Chronic/complications/epidemiology/genetics/*virology MH - High-Throughput Nucleotide Sequencing MH - Humans MH - Liver Neoplasms/epidemiology/genetics/*virology MH - Phylogeny MH - Protein Precursors/genetics PMC - PMC5841821 COIS- The authors have declared that no competing interests exist. EDAT- 2018/02/24 06:00 MHDA- 2018/07/19 06:00 PMCR- 2018/02/23 CRDT- 2018/02/24 06:00 PHST- 2017/06/30 00:00 [received] PHST- 2018/01/17 00:00 [accepted] PHST- 2018/03/07 00:00 [revised] PHST- 2018/02/24 06:00 [pubmed] PHST- 2018/07/19 06:00 [medline] PHST- 2018/02/24 06:00 [entrez] PHST- 2018/02/23 00:00 [pmc-release] AID - PGENETICS-D-17-01296 [pii] AID - 10.1371/journal.pgen.1007206 [doi] PST - epublish SO - PLoS Genet. 2018 Feb 23;14(2):e1007206. doi: 10.1371/journal.pgen.1007206. eCollection 2018 Feb.