PMID- 37398941 OWN - NLM STAT- PubMed-not-MEDLINE LR - 20230705 IS - 2475-9066 (Print) IS - 2475-9066 (Electronic) IS - 2475-9066 (Linking) VI - 8 IP - 82 DP - 2023 TI - CVtreeMLE: Efficient Estimation of Mixed Exposures using Data Adaptive Decision Trees and Cross-Validated Targeted Maximum Likelihood Estimation in R. LID - 4181 [pii] LID - 10.21105/joss.04181 [doi] AB - Statistical causal inference of mixed exposures has been limited by reliance on parametric models and, until recently, by researchers considering only one exposure at a time, usually estimated as a beta coefficient in a generalized linear regression model (GLM). This independent assessment of exposures poorly estimates the joint impact of a collection of the same exposures in a realistic exposure setting. Marginal methods for mixture variable selection such as ridge/lasso regression are biased by linear assumptions and the interactions modeled are chosen by the user. Clustering methods such as principal component regression lose both interpretability and valid inference. Newer mixture methods such as quantile g-computation (Keil et al., 2020) are biased by linear/additive assumptions. More flexible methods such as Bayesian kernel machine regression (BKMR)(Bobb et al., 2014) are sensitive to the choice of tuning parameters, are computationally taxing and lack an interpretable and robust summary statistic of dose-response relationships. No methods currently exist which finds the best flexible model to adjust for covariates while applying a non-parametric model that targets for interactions in a mixture and delivers valid inference for a target parameter. Non-parametric methods such as decision trees are a useful tool to evaluate combined exposures by finding partitions in the joint-exposure (mixture) space that best explain the variance in an outcome. However, current methods using decision trees to assess statistical inference for interactions are biased and are prone to overfitting by using the full data to both identify nodes in the tree and make statistical inference given these nodes. Other methods have used an independent test set to derive inference which does not use the full data. The CVtreeMLE R package provides researchers in (bio)statistics, epidemiology, and environmental health sciences with access to state-of-the-art statistical methodology for evaluating the causal effects of a data-adaptively determined mixed exposure using decision trees. Our target audience are those analysts who would normally use a potentially biased GLM based model for a mixed exposure. Instead, we hope to provide users with a non-parametric statistical machine where users simply specify the exposures, covariates and outcome, CVtreeMLE then determines if a best fitting decision tree exists and delivers interpretable results. FAU - McCoy, David AU - McCoy D AUID- ORCID: 0000-0002-5515-6307 AD - Division of Environmental Health Sciences, University of California, Berkeley, CA, United States of America. FAU - Hubbard, Alan AU - Hubbard A AUID- ORCID: 0000-0002-3769-0127 AD - Department of Biostatistics, University of California, Berkeley, CA, United States of America. FAU - Van der Laan, Mark AU - Van der Laan M AUID- ORCID: 0000-0003-1432-5511 AD - Department of Biostatistics, University of California, Berkeley, CA, United States of America. LA - eng GR - P42 ES004705/ES/NIEHS NIH HHS/United States PT - Journal Article DEP - 20230221 PL - United States TA - J Open Source Softw JT - Journal of open source software JID - 101708638 PMC - PMC10312067 MID - NIHMS1889635 EDAT- 2023/07/03 13:06 MHDA- 2023/07/03 13:07 PMCR- 2023/06/30 CRDT- 2023/07/03 11:53 PHST- 2023/07/03 13:07 [medline] PHST- 2023/07/03 13:06 [pubmed] PHST- 2023/07/03 11:53 [entrez] PHST- 2023/06/30 00:00 [pmc-release] AID - 4181 [pii] AID - 10.21105/joss.04181 [doi] PST - ppublish SO - J Open Source Softw. 2023;8(82):4181. doi: 10.21105/joss.04181. Epub 2023 Feb 21.