Springer Nature
Browse
12864_2021_7949_MOESM1_ESM.docx (3.09 MB)

Additional file 1 of Mid-pass whole genome sequencing enables biomedical genetic studies of diverse populations

Download (3.09 MB)
Version 2 2022-01-11, 04:41
Version 1 2021-11-01, 04:19
journal contribution
posted on 2022-01-11, 04:41 authored by Anne-Katrin Emde, Amanda Phipps-Green, Murray Cadzow, C. Scott Gallagher, Tanya J. Major, Marilyn E. Merriman, Ruth K. Topless, Riku Takei, Nicola Dalbeth, Rinki Murphy, Lisa K. Stamp, Janak de Zoysa, Philip L. Wilcox, Keolu Fox, Kaja A. Wasik, Tony R. Merriman, Stephane E. Castel
Additional file 1: Figure S1. Benchmarking of libraries generated with low-pass kits sequenced at intermediate coverage levels. a) Mean coverage across the library types. b) Per-sample duplicate rate over (deduplicated) sequencing coverage. c) Genotype quality (GQ) as a function of mean GQ (averaged over 2 × 12 samples). Fraction of variant calls that overlap between replicates (d), and their genotype concordance (e) at either all variants (GQ > 0) or high-confidence variants (GQ > 20). f) Recall, g) Precision and h) Non-reference concordance rates computed per sample against the 1000 Genomes high coverage call set [32] as “truth”. The single HP4 sample with coverage>10x was excluded from this comparison. Figure S2. a) Overview of data types available for participants and how they overlap. b) Distribution of de-duplicated sequencing coverage per sample for low-pass samples, c) TruSeq PCR-free high-coverage samples, d) TruSeq Nano high-coverage samples. e) Distribution of sequencing duplicate rates per sample for low-pass samples, f) for TruSeq PCR-free samples, and g) for TruSeq Nano samples. h) Breakdown of number of individuals by self-reported ethnicity and sequencing type. Figure S3. Effect of GQ filtering on indel calling performance. a) Recall, b) Precision, and c) NCR for indels over varying GQ thresholds. Figure S4. Accuracy of flagged sites by flag type. a) Overview of the different flag types that characterize variants by comparing (filtered) sequencing-based genotype with genotype after imputation. A call is flagged with IM = 0 if sequencing-based genotype and imputed genotype agree fully. Given low coverage, we consider the lack of sequencing data evidence for an imputed allele as “not inconsistent” while the disappearance of an allele after imputation is categorized as “inconsistent”. IM = 1 therefore flags imputed calls that are not inconsistent with the sequencing-based call (either because it was missing or we may have only observed one of two alleles in sequencing). IM = 2 and IM = 3 flag sites that are inconsistent between sequencing-based and imputed calls, where IM = 2 calls were heterozygous in sequencing (potentially due to sequencing or mapping artifacts or contamination, or an error in imputation) and IM = 3 calls were homozygous for the opposite allele. b) Fraction of SNV calls in each IM flag category. c) Fraction of indel calls in each IM flag category. d) Recall (normalized to each individual’s overall SNV recall), e) Precision, and f) NCR of SNVs. g) Normalized recall, h) Precision, and i) NCR of indels. Figure S5. Detailed performance (recall, precision, and NCR) of SNV and indel calling both genomewide (including repetitive regions) as well as in high-confidence regions only, shown over coverage. a) SNVs genomewide, b) SNVs in high-confidence regions, c) Indels genomewide, d) Indels in high-confidence regions. Figure S6. Performance comparison across different pipeline stages/runs. a) Overview of tested call sets. “Single” refers to individually called mid-pass data (GQ > 17). “MP” and “MP-HP” refer to the joint-called (“joint”) and imputed (“imp”) call sets using mid-pass data from 1510 individuals (MP) and mid-pass data from 1410 individuals plus high-pass data from 100 high-pass individuals (MP-HP) For more details see methods. b) Recall, c) Precision, and d) NCR for SNVs. e) Recall, f) Precision, and g) NCR for indels. Figure S7. Analysis of European admixture in the study cohort. ADMIXTURE was run assuming two populations on the cohort with 91 British individuals from 1000 Genomes (GBR) included to capture European ancestry. Shown are the proportions of ancestry estimated (population 1 = red, population 2 = orange). Individuals are ordered by cohort (GBR/Polynesian). Analysis of PC1 from PC analysis versus proportion of population 1 ancestry from ADMIXTURE analysis found that PC1 is highly correlated with the degree of estimated European ancestry (Spearman’s ρ = − 0.89, p < 2.2e-16). Figure S8. Principal component (PC) analysis of imputed genotype calls. a-i) PC1 vs PC2–10, labeled based on self-reported ethnicity. j) PC5 vs log(coverage) with data points colored by sequencing depth and symbols corresponding to library type (plexWell LP384 used for low-pass sequencing and TrueSeq PCR-Free used for high-pass sequencing). Individuals with ≥1.5x coverage and both SNVs and indels in high-confidence regions of the genome were used for the analyses. Figure S9. Effect of cohort size on performance. a) PCA of self-reported Aotearoa New Zealand Māori individuals that were included in the analysis. b) Sequencing type breakdown within subcohorts (MP, mid-pass, HP, high-pass). c) Recall, d) Precision, and e) NCR for SNVs. f) Recall, g) Precision, and h) NCR for indels. Figure S10. MAF-based comparison of variants in high-confidence regions, split by coverage level. a) Recall, b) Precision, and c) NCR of SNVs over the full MAF range. Panels d), e), and f) show the same plots zoomed in on the 0–7.5% MAF range. Panels g) to l) show the same for indels. Figure S11. Allele frequency distribution of common (MAF > 5%) variants in the study cohort that are either absent from (a) or rare in (b) 1000 Genomes. Indels located in high-confidence regions of the genome and all SNVs were included in the analysis.

Funding

Health Research Council of New Zealand Lottery Health Research

History