PMID- 35977827 OWN - NLM STAT- MEDLINE DCOM- 20220819 LR - 20240409 IS - 2575-1077 (Electronic) IS - 2575-1077 (Linking) VI - 5 IP - 12 DP - 2022 Aug 17 TI - Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues. LID - 10.26508/lsa.202101297 [doi] LID - e202101297 AB - Expression quantitative trait loci (eQTLs), or single-nucleotide polymorphisms that affect average gene expression levels, provide important insights into context-specific gene regulation. Classic eQTL analyses use one-to-one association tests, which test gene-variant pairs individually and ignore correlations induced by gene regulatory networks and linkage disequilibrium. Probabilistic topic models, such as latent Dirichlet allocation, estimate latent topics for a collection of count observations. Prior multimodal frameworks that bridge genotype and expression data assume matched sample numbers between modalities. However, many data sets have a nested structure where one individual has several associated gene expression samples and a single germline genotype vector. Here, we build a telescoping bimodal latent Dirichlet allocation (TBLDA) framework to learn shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual's genotype. By using raw count data, our model avoids possible adulteration via normalization procedures. Ancestral structure is captured in a genotype-specific latent space, effectively removing it from shared components. Using GTEx v8 expression data across 10 tissues and genotype data, we show that the estimated topics capture meaningful and robust biological signal in both modalities and identify associations within and across tissue types. We identify 4,645 cis-eQTLs and 995 trans-eQTLs by conducting eQTL mapping between the most informative features in each topic. Our TBLDA model is able to identify associations using raw sequencing count data when the samples in two separate data modalities are matched one-to-many, as is often the case in biological data. Our code is freely available at https://github.com/gewirtz/TBLDA. CI - (c) 2022 Gewirtz et al. FAU - Gewirtz, Ariel Dh AU - Gewirtz AD AUID- ORCID: 0000-0001-9801-1354 AD - Lewis-Sigler Institute of Integrative Genomics, Princeton University, Princeton, NJ, USA. FAU - Townes, F William AU - Townes FW AD - Department of Computer Science, Princeton University, Princeton, NJ, USA. FAU - Engelhardt, Barbara E AU - Engelhardt BE AUID- ORCID: 0000-0002-6139-7334 AD - Department of Computer Science, Princeton University, Princeton, NJ, USA. AD - Gladstone Institutes, San Francisco, CA, USA. LA - eng GR - R01 HL133218/HL/NHLBI NIH HHS/United States GR - U2C CA233195/CA/NCI NIH HHS/United States PT - Journal Article PT - Research Support, N.I.H., Extramural PT - Research Support, Non-U.S. Gov't PT - Research Support, U.S. Gov't, Non-P.H.S. DEP - 20220817 PL - United States TA - Life Sci Alliance JT - Life science alliance JID - 101728869 SB - IM MH - Gene Expression Regulation MH - Gene Regulatory Networks MH - Genotype MH - *Polymorphism, Single Nucleotide/genetics MH - *Quantitative Trait Loci/genetics PMC - PMC9387650 COIS- BE Engelhardt is on the SAB of Creyon Bio, ArrePath, and Freenome. EDAT- 2022/08/18 06:00 MHDA- 2022/08/20 06:00 PMCR- 2022/08/17 CRDT- 2022/08/17 21:41 PHST- 2021/11/11 00:00 [received] PHST- 2022/07/15 00:00 [revised] PHST- 2022/07/18 00:00 [accepted] PHST- 2022/08/17 21:41 [entrez] PHST- 2022/08/18 06:00 [pubmed] PHST- 2022/08/20 06:00 [medline] PHST- 2022/08/17 00:00 [pmc-release] AID - 5/12/e202101297 [pii] AID - LSA-2021-01297 [pii] AID - 10.26508/lsa.202101297 [doi] PST - epublish SO - Life Sci Alliance. 2022 Aug 17;5(12):e202101297. doi: 10.26508/lsa.202101297.