PMID- 31150060 OWN - NLM STAT- MEDLINE DCOM- 20200706 LR - 20220716 IS - 1367-4811 (Electronic) IS - 1367-4803 (Print) IS - 1367-4803 (Linking) VI - 35 IP - 23 DP - 2019 Dec 1 TI - Augmented Interval List: a novel data structure for efficient genomic interval search. PG - 4907-4911 LID - 10.1093/bioinformatics/btz407 [doi] AB - MOTIVATION: Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. RESULTS: We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5-18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4-60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. AVAILABILITY AND IMPLEMENTATION: An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CI - (c) The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com. FAU - Feng, Jianglin AU - Feng J AD - Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA. FAU - Ratan, Aakrosh AU - Ratan A AD - Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA. AD - Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA. AD - Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA. FAU - Sheffield, Nathan C AU - Sheffield NC AD - Center for Public Health Genomics, University of Virginia, Charlottesville, VA, USA. AD - Department of Public Health Sciences, University of Virginia, Charlottesville, VA, USA. AD - Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA. AD - Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA. LA - eng GR - R35 GM128636/GM/NIGMS NIH HHS/United States PT - Journal Article PT - Research Support, N.I.H., Extramural PT - Research Support, Non-U.S. Gov't PL - England TA - Bioinformatics JT - Bioinformatics (Oxford, England) JID - 9808944 SB - IM MH - Algorithms MH - Genome MH - *Genomics MH - *Software PMC - PMC6901075 EDAT- 2019/06/01 06:00 MHDA- 2020/07/07 06:00 PMCR- 2020/12/01 CRDT- 2019/06/01 06:00 PHST- 2019/01/18 00:00 [received] PHST- 2019/04/08 00:00 [revised] PHST- 2019/05/28 00:00 [accepted] PHST- 2019/06/01 06:00 [pubmed] PHST- 2020/07/07 06:00 [medline] PHST- 2019/06/01 06:00 [entrez] PHST- 2020/12/01 00:00 [pmc-release] AID - 5509521 [pii] AID - btz407 [pii] AID - 10.1093/bioinformatics/btz407 [doi] PST - ppublish SO - Bioinformatics. 2019 Dec 1;35(23):4907-4911. doi: 10.1093/bioinformatics/btz407.