General population allele frequencies – such as those made available by ExAC and gnomAD – are invaluable for variant interpretation. As such, Invitae has developed an approach for evaluating population data that is more sophisticated than simply comparing allele frequencies against a single threshold.
How does Invitae calculate allele frequency values?
Typically, the evaluation of population data involves a very simple allele frequency (AF)* calculation of a variant:
However, this approach does not work well when comparing allele frequencies derived from two cohorts of different sizes, such as those pervasive in gnomAD and ExAC. For example, a variant in intronic or promoter regions may be represented by a cohort of a few thousand individuals, while a variant in the exonic region may be covered by a few hundred thousand individuals. Even if those two variants resulted in the same allele frequency, the precision of those frequency values will be vastly different. To account for this issue, assessment of population frequency is done by calculating the 95%confidence value of the calculated raw allele frequency. We use a statistical model called beta-distributions, which allows us to say, “we are >95% confident the allele frequency of this variant is at least greater than xxx%”. These beta-distribution derived values are what we use to assess variants.
For illustrative purposes, here are gnomAD data from two BRCA1 variants. Both variants occur at an allele frequency right around 0.1%. However, due to the small sample size for the second variant, our confidence in the allele frequency is much lower.
Source | # of variants | # of chromosomes sequence | Raw allele frequency | I am 95% confident that the variant is at least... |
BRCA1 NM_007294.3:c.148G>A (rs28897677) | ||||
gnomAD (non-Finnish Europeans) | 114 | 128956 | 0.09% | 0.076% |
BRCA1 NM_007294.3:c.1745C>T (rs786202386) | ||||
gnomAD (other) | 1 | 1084 | 0.09% | 0.032% |
For more on beta-distributions, read this Wikipedia page.
Excel has a beta-distribution function that equals BETA.INV(prob, A, B) where the probability value is set to 0.05, A is the number of variants plus one, and B is the number of chromosomes sequenced minus the number of variants plus one.
*AF = total variant count / total # of chromosomes sequenced
What allele frequency thresholds does Invitae use?
The American College of Medical Genetics (ACMG) guidelines recommend that when “(an) allele frequency is greater than expected for a disorder,” it should be considered strong evidence for a benign classification (PMID: 25741868). Rather than draw arbitrary thresholds, we empirically derived the appropriate thresholds using the allele frequencies of known pathogenic variants, as described previously in PMID: 28166811.
Based on this method, we derived 3 different thresholds:
Very high: In the absence of evidence supporting a pathogenic classification, variants at this threshold is classified as Benign. This was empirically calculated to be an allele frequency value greater than approximately 99.9% of all known pathogenic variants.
High: In the absence of evidence supporting a pathogenic classification, variants at this threshold is classified as Likely Benign. This was empirically calculated to be an allele frequency value greater than approximately 99.7% of all known pathogenic variants.
Somewhat high: An allele frequency range that suggests the variant is benign but will remain a VUS in the absence of additional supporting evidence. This was empirically calculated to be an allele frequency value greater than approximately 95% of all known pathogenic variants.
Finally, because pathogenic variants tend to be at higher allele frequency for recessive conditions compared to dominant conditions, we calculated these thresholds separately. They are as follows:
Allele frequency thresholds (based on 95% confidence interval)
Very high | High | Somewhat high | |
Dominant | 0.261% | 0.052% | 0.020% |
Recessive | 0.523% | 0.157% | 0.038% |