PMID- 32233158 OWN - NLM STAT- MEDLINE DCOM- 20200402 LR - 20200407 IS - 1598-6357 (Electronic) IS - 1011-8934 (Print) IS - 1011-8934 (Linking) VI - 35 IP - 12 DP - 2020 Mar 30 TI - Extracting Structured Genotype Information from Free-Text HLA Reports Using a Rule-Based Approach. PG - e78 LID - 10.3346/jkms.2020.35.e78 [doi] LID - e78 AB - BACKGROUND: Human leukocyte antigen (HLA) typing is important for transplant patients to prevent a severe mismatch reaction, and the result can also support the diagnosis of various disease or prediction of drug side effects. However, such secondary applications of HLA typing results are limited because they are typically provided in free-text format or PDFs on electronic medical records. We here propose a method to convert HLA genotype information stored in an unstructured format into a reusable structured format by extracting serotype/allele information. METHODS: We queried HLA typing reports from the clinical data warehouse of Seoul National University Hospital (SUPPREME) from 2000 to 2018 as a rule-development data set (64,024 reports) and from the most recent year (6,181 reports) as a test set. We used a rule-based natural language approach using a Python regex function to extract the 1) number of patients in the report, 2) clinical characteristics such as indication of the HLA testing, and 3) precise HLA genotypes. The performance of the rules and codes was evaluated by comparison between the extracted results from the test set and a validation set generated by manual curation. RESULTS: Among 11,287 reports for development set and 1,107 for the test set describing HLA typing for a single patient, iterative rule generation developed 124 extracting rules and 8 cleaning rules for HLA genotypes. Application of these rules extracted HLA genotypes with 0.892-0.999 precision and 0.795-0.998 recall for the five HLA genes. The precision and recall of the extracting rules for the number of patients in a report were 0.997 and 0.994 and those for the clinical variable extraction were 0.997 and 0.992, respectively. All extracted HLA alleles and serotypes were transformed according to formal HLA nomenclature by the cleaning rules. CONCLUSION: The rule-based HLA genotype extraction method shows reliable accuracy. We believe that there are significant number of patients who takes profit when this under-used genetic information will be return to them. CI - (c) 2020 The Korean Academy of Medical Sciences. FAU - Lee, Kye Hwa AU - Lee KH AUID- ORCID: 0000-0002-7593-7020 AD - Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea. geffa@snu.as.kr. FAU - Kim, Hyo Jung AU - Kim HJ AUID- ORCID: 0000-0001-9555-0926 AD - Division of Biomedical Informatics, Seoul National University Biomedical Informatics and Systems Biomedical Informatics Research Center, Seoul National University College of Medicine, Seoul, Korea. FAU - Kim, Yi Jun AU - Kim YJ AUID- ORCID: 0000-0002-1763-4267 AD - Center for Precision Medicine, Seoul National University Hospital, Seoul, Korea. FAU - Kim, Ju Han AU - Kim JH AUID- ORCID: 0000-0003-1522-9038 AD - Division of Biomedical Informatics, Seoul National University Biomedical Informatics and Systems Biomedical Informatics Research Center, Seoul National University College of Medicine, Seoul, Korea. FAU - Song, Eun Young AU - Song EY AUID- ORCID: 0000-0003-1286-9611 AD - Department of Laboratory Medicine, Seoul National University College of Medicine, Seoul, Korea. eysong1@snu.ac.kr. LA - eng GR - NRF-2018R1D1A1A02086109/NRF/National Research Foundation of Korea/Korea PT - Journal Article DEP - 20200330 PL - Korea (South) TA - J Korean Med Sci JT - Journal of Korean medical science JID - 8703518 RN - 0 (HLA Antigens) SB - IM MH - Algorithms MH - Data Warehousing MH - Electronic Health Records MH - Genotype MH - HLA Antigens/*genetics MH - *Histocompatibility Testing MH - Humans MH - *Information Storage and Retrieval MH - *Natural Language Processing MH - Seoul PMC - PMC7105511 OTO - NOTNLM OT - Data Sets as Topic OT - Electronic Medical Record OT - Genetic Testing OT - HLA Test OT - Major Histocompatibility Complex COIS- The authors have no potential conflicts of interest to disclose. EDAT- 2020/04/02 06:00 MHDA- 2020/04/03 06:00 PMCR- 2020/03/30 CRDT- 2020/04/02 06:00 PHST- 2019/05/16 00:00 [received] PHST- 2020/01/29 00:00 [accepted] PHST- 2020/04/02 06:00 [entrez] PHST- 2020/04/02 06:00 [pubmed] PHST- 2020/04/03 06:00 [medline] PHST- 2020/03/30 00:00 [pmc-release] AID - 35.e78 [pii] AID - 10.3346/jkms.2020.35.e78 [doi] PST - epublish SO - J Korean Med Sci. 2020 Mar 30;35(12):e78. doi: 10.3346/jkms.2020.35.e78.