PMID- 30499597 OWN - NLM STAT- MEDLINE DCOM- 20190228 LR - 20190228 IS - 2473-4209 (Electronic) IS - 0094-2405 (Linking) VI - 46 IP - 2 DP - 2019 Feb TI - Automated data extraction and ensemble methods for predictive modeling of breast cancer outcomes after radiation therapy. PG - 1054-1063 LID - 10.1002/mp.13314 [doi] AB - PURPOSE: The purpose of this study was to compare the effectiveness of ensemble methods (e.g., random forests) and single-model methods (e.g., logistic regression and decision trees) in predictive modeling of post-RT treatment failure and adverse events (AEs) for breast cancer patients using automatically extracted EMR data. METHODS: Data from 1967 consecutive breast radiotherapy (RT) courses at one institution between 2008 and 2015 were automatically extracted from EMRs and oncology information systems using extraction software. Over 230 variables were extracted spanning the following variable segments: patient demographics, medical/surgical history, tumor characteristics, RT treatment history, and AEs tracked using CTCAEv4.0. Treatment failure was extracted algorithmically by searching posttreatment encounters for evidence of local, nodal, or distant failure. Individual models were trained using decision trees, logistic regression, random forests, and boosted decision trees to predict treatment failures and AEs. Models were fit on 75% of the data and evaluated for probability calibration and area under the ROC curve (AUC) on the remaining test set. The impact of each variable segment was assessed by retraining without the segment and measuring change in AUC (DeltaAUC). RESULTS: All AUC values were statistically significant (P < 0.05). Ensemble methods outperformed single-model methods across all outcomes. The best ensemble method outperformed decision trees and logistic regression by an average AUC of 0.053 and 0.034, respectively. Model probabilities were well calibrated as evidenced by calibration curves. Excluding the patient medical history variable segment led to the largest AUC reduction in all models (Average DeltaAUC = -0.025), followed by RT treatment history (-0.021) and tumor information (-0.015). CONCLUSION: In this largest such study in breast cancer performed to date, automatically extracted EMR data provided a basis for reliable outcome predictions across multiple statistical methods. Ensemble methods provided substantial advantages over single-model methods. Patient medical history contributed the most to prediction quality. CI - (c) 2018 American Association of Physicists in Medicine. FAU - Lindsay, William D AU - Lindsay WD AD - Oncora Medical, Philadelphia, PA, 19103, USA. FAU - Ahern, Christopher A AU - Ahern CA AD - Oncora Medical, Philadelphia, PA, 19103, USA. FAU - Tobias, Jacob S AU - Tobias JS AD - Oncora Medical, Philadelphia, PA, 19103, USA. FAU - Berlind, Christopher G AU - Berlind CG AD - Oncora Medical, Philadelphia, PA, 19103, USA. FAU - Chinniah, Chidambaram AU - Chinniah C AD - Department of Radiation Oncology, Hospital of the University of Pennsylvania, Philadelphia, PA, 19104, USA. FAU - Gabriel, Peter E AU - Gabriel PE AD - Department of Radiation Oncology, Hospital of the University of Pennsylvania, Philadelphia, PA, 19104, USA. FAU - Gee, James C AU - Gee JC AD - Department of Radiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA. FAU - Simone, Charles B 2nd AU - Simone CB 2nd AD - Department of Radiation Oncology, University of Maryland School of Medicine, Baltimore, MD, 21201, USA. LA - eng PT - Journal Article DEP - 20181228 PL - United States TA - Med Phys JT - Medical physics JID - 0425746 SB - IM MH - Breast Neoplasms/*pathology/*radiotherapy MH - Data Mining/*methods MH - *Decision Trees MH - *Electronic Health Records MH - Female MH - Humans MH - *Machine Learning MH - Middle Aged MH - Predictive Value of Tests MH - Radiotherapy Dosage MH - Treatment Outcome OTO - NOTNLM OT - automated data extraction OT - ensemble methods OT - machine learning OT - predictive modeling OT - radiotherapy outcomes EDAT- 2018/12/01 06:00 MHDA- 2019/03/01 06:00 CRDT- 2018/12/01 06:00 PHST- 2018/02/21 00:00 [received] PHST- 2018/11/11 00:00 [revised] PHST- 2018/11/12 00:00 [accepted] PHST- 2018/12/01 06:00 [pubmed] PHST- 2019/03/01 06:00 [medline] PHST- 2018/12/01 06:00 [entrez] AID - 10.1002/mp.13314 [doi] PST - ppublish SO - Med Phys. 2019 Feb;46(2):1054-1063. doi: 10.1002/mp.13314. Epub 2018 Dec 28.