Most existing tools for genotyping tandem repeats (TRs) from short-read or long-read sequencing data require the user to provide the exact start/end coordinates and motifs of all TR loci to genotype. Due to this, a key step in any genome-wide TR analysis is to define the loci to include (aka. the TR catalog). This choice directly affects sensitivity since any loci not in the catalog will be missing from the downstream analysis. Also, it determines compute costs since runtimes of most current tools are proportional to the number of loci being genotyped. There is no consensus in the field on how to choose a catalog, with recent papers and presentations all using very different sets of loci as their starting points.
Catalogs of TR loci can differ in multiple ways, including:
catalog | total loci | chrX | chrY | homo- polymers? |
motif sizes |
locus sizes |
# of repeats | pure repeats
|
trimmed
|
overlapping loci
|
---|---|---|---|---|---|---|---|---|---|---|
Known disease-associated loci | 72 | 8 (11.3%) | 0 (0.0%) | 1-27bp | 12-150bp | 1-50x | 58.3% | 100.0% | 0 (0.0%) | |
TRF catalog: only pure repeats + homopolymers spanning ≥ 6bp in hg38 |
9,223,577 | 489,574 (5.3%) | 0 (0.0%) | 1-50bp | 6-1597bp | 3-300x | 100.0% | 78.0% | 0 (0.0%) | |
GangSTR v17 | 1,340,266 | 65,719 (4.9%) | 8,979 (0.7%) | 1-20bp | 10-612bp | 3-248x | 100.0% | 100.0% | 16 (0.0%) | |
TRF catalog: pure & interrupted repeats spanning ≥ 6bp in hg38 |
4,596,675 | 244,906 (5.3%) | 0 (0.0%) | 2-50bp | 6-6800bp | 3-1091x | 85.1% | 100.0% | 84,822 (1.8%) | |
HipSTR | 1,634,957 | 79,233 (4.8%) | 12,638 (0.8%) | 1-6bp | 11-102545bp | 3-20509x | 46.6% | 100.0% | 0 (0.0%) | |
TRGT | 171,146 | 8,558 (5.0%) | 0 (0.0%) | 2-24bp | 4-532bp | 2-133x | 95.5% | 100.0% | 0 (0.0%) | |
Illumina | 174,285 | 8,561 (4.9%) | 0 (0.0%) | 2-24bp | 4-532bp | 2-133x | 95.5% | 100.0% | 0 (0.0%) | |
TR truth set from 51 HPRC samples |
1,573,403 | 70,885 (4.5%) | 4,806 (0.3%) | 1-50bp | 1-1591bp | 1-301x | 94.7% | 100.0% | 131,515 (8.4%) | |
popSTR | 5,401,399 | 0 (0.0%) | 0 (0.0%) | 1-6bp | 11-137bp | 2-128x | 78.0% | 42.0% | 11,202 (0.2%) | |
Adotto v1.2 | 2,934,732 | 109,740 (3.7%) | 16,585 (0.6%) | 2-495bp | 4-49816bp | 1-8836x | 41.6% | 32.0% | 26,198 (0.9%) | |
Platinum TRs v1.0 | 7,722,729 | 328,981 (5.2%) | 47,192 (0.7%) | 1-822bp | 4-9545bp | 1-2659x | 62.2% | 36.1% | 0 (0.0%) |
catalog | total loci | chrX | chrY | homo- polymers? |
motif sizes |
locus sizes |
# of repeats | pure repeats
|
trimmed
|
overlapping loci
|
---|---|---|---|---|---|---|---|---|---|---|
Known disease-associated loci | 72 | 8 (11.3%) | 0 (0.0%) | 1-27bp | 12-150bp | 1-50x | 58.3% | 100.0% | 0 (0.0%) |
## download the publicly-available catalogs from the google buckets and repos from [weisburd 2023] known_disease_associated_loci=https://raw.githubusercontent.com/broadinstitute/str-analysis/main/str_analysis/variant_catalogs/variant_catalog_without_offtargets.GRCh38.json trf_pure_6bp_catalog=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/repeat_specs_GRCh38_without_mismatches.including_homopolymers.sorted.at_least_6bp.bed.gz trf_pure_9bp_catalog=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/repeat_specs_GRCh38_without_mismatches.including_homopolymers.sorted.at_least_9bp.bed.gz gangstr_catalog=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/hg38_ver17.adjusted.bed.gz trgt_catalog=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/trgt_repeat_catalog.hg38.reformatted_to_motif_only.bed.gz illumina_catalog=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/illumina_variant_catalog.sorted.bed.gz truth_set_catalog=https://storage.cloud.google.com/str-truth-set-v2/results_with_homopolymers/results_from_combined/combined.51_samples.positive_loci.json hipstr_catalog=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/hg38.hipstr_reference.adjusted.bed.gz trf_6bp_catalog_with_mismatches=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/repeat_specs_GRCh38_allowing_mismatches.sorted.trimmed.at_least_6bp.bed.gz trf_9bp_catalog_with_mismatches=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/repeat_specs_GRCh38_allowing_mismatches.sorted.trimmed.at_least_9bp.bed.gz popstr_catalog=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/popstr_catalog_v2.bed.gz adotto_catalog=https://storage.cloud.google.com/str-truth-set/hg38/ref/other/adotto_tr_catalog_v1.2.bed.gz ref_fasta=/local/path/hg38.fa output_prefix=combined_catalog.trf_at_least_9bp ## annotate each catalog and discard loci with N in the motif for path in ${known_disease_associated_loci} \ ${trf_pure_9bp_catalog} \ ${gangstr_catalog} \ ${trf_9bp_catalog_with_mismatches} \ ${hipstr_catalog} \ ${trgt_catalog} \ ${illumina_catalog} \ ${truth_set_catalog} \ ${popstr_catalog} \ ${adotto_catalog}; do # download each catalog wget $path # annotate loci and discard edge cases like motif=N python3 -u -m str_analysis.annotate_and_filter_str_catalog \ -r ${ref_fasta} \ --output-stats \ --output-tsv \ --verbose \ --discard-loci-with-non-acgt-bases \ ${path} done ## merge all catalogs into a single file (excluding the popSTR and Adotto catalogs because they are unusual and/or suboptimal in some ways) python3 -u -m str_analysis.combine_str_catalogs --verbose \ --merge-adjacent-loci-with-same-motif \ --output-prefix ${output_prefix} \ --add-source-field \ $(basename ${known_disease_associated_loci} | sed 's/.json/.annotated_and_filtered.json.gz/') \ $(basename ${trf_pure_9bp_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${gangstr_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${trf_9bp_catalog_with_mismatches} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${hipstr_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${trgt_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${illumina_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${truth_set_catalog} | sed 's/.json/.annotated_and_filtered.json.gz/') ## add adjacent loci to the catalog python3 -u -m str_analysis.add_adjacent_loci_to_expansion_hunter_catalog \ --ref-fasta ${ref_fasta} \ --max-distance-between-adjacent-repeats 6 \ --max-adjacent-repeats 2 \ --source-of-adjacent-loci ~/code/str-truth-set/ref/other/repeat_specs_GRCh38_without_mismatches.including_homopolymers.sorted.at_least_6bp.bed.gz \ ${output_prefix}.json.gz ## annotate the combined catalog python3 -u -m str_analysis.annotate_and_filter_str_catalog \ -r ${ref_fasta} \ --output-stats \ --output-tsv \ --verbose \ --discard-loci-with-non-acgt-bases \ ${output_prefix}.with_adjacent_loci.json.gz ## compute catalog stats python3 -u -m str_analysis.compute_catalog_stats --verbose \ $(basename ${known_disease_associated_loci} | sed 's/.json/.annotated_and_filtered.json.gz/') \ $(basename ${trf_pure_6bp_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${gangstr_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${popstr_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${trf_6bp_catalog_with_mismatches} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${hipstr_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${trgt_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${illumina_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/') \ $(basename ${truth_set_catalog} | sed 's/.json/.annotated_and_filtered.json.gz/') \ $(basename ${adotto_catalog} | sed 's/.bed.gz/.annotated_and_filtered.json.gz/')Running these steps yields a combined catalog with the following properties:
Stats for combined_catalog.trf_at_least_9bp.with_adjacent_loci.annotated_and_filtered.json.gz: 4,918,783 total loci 1,139,440 out of 4,918,783 ( 23.2%) loci have adjacent repeats 6,196,806 total repeat intervals 4,522,377 out of 6,196,806 ( 73.0%) repeat interval size is an integer multiple of the motif size (aka. trimmed) 1,792,664 out of 6,196,806 ( 28.9%) repeat intervals are homopolymers 1,534,528 out of 6,196,806 ( 24.8%) repeat intervals overlap each other by at least two motif lengths Examples of overlapping repeats: chr19:22862456-22862510, chr16:87038648-87038702, chr6:131131459-131131467, chr3:75804442-75804495, chrX:53113326-53113335, chrX:141112874-141112888, chr14:36902984-36902999, chr12:6740055-6740065, chr16:7165935-7165979, chr15:91057135-91057191 Ranges: Motif size range: 1-50bp Locus size range: 1-102545bp Num repeats range: 1-20509x repeats Maximum locus size = 102545bp @ chrY:11490351-11592896 (AATGG) Minimum fraction pure bases = 0.00 @ chrY:26670808-26670828 (AATGG) Minimum fraction pure repeats = 0.00 @ chrX:155370990-155371126 (TATATAACTATAAATAAGTATATATTTATATATT) Minimum overall mappability = 0.00 @ chrY:24489213-24489225 (GCCT) chrX: 322,966 out of 6,196,806 ( 5.2%) repeat intervals chrY: 16,670 out of 6,196,806 ( 0.3%) repeat intervals chrM: 15 out of 6,196,806 ( 0.0%) repeat intervals alt contigs: 0 out of 6,196,806 ( 0.0%) repeat intervals Motif size distribution: 1bp: 1,792,664 out of 6,196,806 ( 28.9%) repeat intervals 2bp: 1,479,000 out of 6,196,806 ( 23.9%) repeat intervals 3bp: 1,532,929 out of 6,196,806 ( 24.7%) repeat intervals 4bp: 839,986 out of 6,196,806 ( 13.6%) repeat intervals 5bp: 256,601 out of 6,196,806 ( 4.1%) repeat intervals 6bp: 126,362 out of 6,196,806 ( 2.0%) repeat intervals 7-24bp: 120,964 out of 6,196,806 ( 2.0%) repeat intervals 25+bp: 48,300 out of 6,196,806 ( 0.8%) repeat intervals Fraction pure bases distribution: 0.0: 8,522 out of 6,196,806 ( 0.1%) repeat intervals 0.1: 8,595 out of 6,196,806 ( 0.1%) repeat intervals 0.2: 16,730 out of 6,196,806 ( 0.3%) repeat intervals 0.3: 16,591 out of 6,196,806 ( 0.3%) repeat intervals 0.4: 15,682 out of 6,196,806 ( 0.3%) repeat intervals 0.5: 36,774 out of 6,196,806 ( 0.6%) repeat intervals 0.6: 58,015 out of 6,196,806 ( 0.9%) repeat intervals 0.7: 82,843 out of 6,196,806 ( 1.3%) repeat intervals 0.8: 141,233 out of 6,196,806 ( 2.3%) repeat intervals 0.9: 71,488 out of 6,196,806 ( 1.2%) repeat intervals 1.0: 5,740,333 out of 6,196,806 ( 92.6%) repeat intervals Mappability distribution: 0.0: 30,591 out of 4,918,783 ( 0.6%) loci 0.1: 21,219 out of 4,918,783 ( 0.4%) loci 0.2: 27,035 out of 4,918,783 ( 0.5%) loci 0.3: 29,659 out of 4,918,783 ( 0.6%) loci 0.4: 38,958 out of 4,918,783 ( 0.8%) loci 0.5: 68,023 out of 4,918,783 ( 1.4%) loci 0.6: 51,683 out of 4,918,783 ( 1.1%) loci 0.7: 70,885 out of 4,918,783 ( 1.4%) loci 0.8: 110,040 out of 4,918,783 ( 2.2%) loci 0.9: 289,968 out of 4,918,783 ( 5.9%) loci 1.0: 4,180,722 out of 4,918,783 ( 85.0%) loci
[Willems 2014], one of the first studies to perform a genome-wide TR analysis using Illumina short read sequencing data, put considerable effort into defining an appropriate catalog, as described in their Results section and Supplementary Materials which are copy-pasted below. However, their approach was limited to the TR alleles observed within the hg19 reference genome. Also, they used simulations to calculate thresholds that distinguish biologically meaningful STRs from random or accidental repetative sequences, but this insilico approach is relatively blunt - just because a sequence could occur randomly with some probability doesn't mean it's not biologically important if it occurs invivo.
Also, data has since become available from the Human Pangenome Research Consortium (HPRC) as well as from other studies such as T2T-YAO from [He 2023] and the Illumina catalog by Qiu, Dolzhenko, et al. that now allows us to directly / empirically identify polymorphic TR loci and study their properties.
Relevant sections from the Results and Supplementary Materials of
Willems T, Gymrek M, Highnam G; 1000 Genomes Project Consortium, Mittelman D, Erlich Y. The landscape of human STR variation. Genome Res. 2014: