RFC1 STR expansions have recently been linked to CANVAS [Cortese 2019].
This STR locus is unique among known pathogenic loci in that it's both autosomal recessive and has a pathogenic repeat motif (AAGGG)
that differs from the motif in the reference genome (AAAAG). Several other benign and pathogenic motifs have also been reported at this locus:
AAAGG is benign [Cortese 2019], ACAGG is pathogenic [Scriba 2020],
while AAGAG and AGAGG have uncertain significance [Akcimen 2019].
Most individuals have the same motif on both chromosomes, while many carry one motif on one chromosome and another motif on the other chromosome. To diagnose cases of CANVAS using WGS data a tool needs to distinguish between these scenarios since only biallelic expansions of the AAGGG or ACAGG motifs are associated with disease.
Existing STR genotyping tools such as ExpansionHunter and GangSTR require users to pre-specify the motif, and so do not work well for loci where the motif can differ among individuals. Other tools such as ExpansionHunterDenovo do not have this limitation, but are unable to distinguish unaffected carriers (such as those that have one benign AAAAG reference allele, and one pathogenic AAGGG allele) from affected individuals (homozygous for the AAGGG allele). In both cases, ExpansionHunter Denovo might detect the AAGGG allele, but would fail to detect the benign allele unless it is also highly expanded, and so would not be useful for distinguishing heterozygous AAGGG carriers from the much rarer instances of homozygous AAGGG genotypes.
As a temporary solution, we addressed these limitations by developing call_non_ref_pathogenic_motifs.py. For the RFC1 locus, the script first detects the one or two motifs present in an individual and then runs ExpansionHunter for each motif to estimate its allele size as well as REViewer to generate read visualizations. The approach used by this script to detect motifs is, coincidentally, a simpler version of the approach used by the recently-released STRling tool. Unbiased comparisons are difficult given that we have a small number of positive controls and designed the script based on the positive controls we had available. However, we found that the script had slightly better sensitivity than STRling on these positive RFC1 controls while maintaining high specificity. Additionally, it allowed us to generate read visualizations in a way not currently possible with other tools. A more detailed description of the script is available on GitHub.
We also plan to add more detailed tool comparisons to this blog in the near future.
Limitations:
The approach used by call_non_ref_pathogenic motifs will likely miss more complex expansions where a single haplotype has multiple motifs - such as the disease-associated (AAAGG)10-25(AAGGG)exp repeats described in [Beecroft 2020].