Identification of gap signals using gap hunting
We developed a simple, computationally fast algorithm, called “gap hunting,” to flag probes on the 450k exhibiting distributions of percent DNAm that cluster into discrete groups. We applied this to DNAm data from 590 whole-blood-derived samples from the Study to Explore Early Development I (SEED I) at various combinations of user-supplied arguments to this procedure termed “threshold” and “outCutoff” (Additional file 1: Figure S1, see “Methods” section for a detailed description of the approach and these arguments). Of the 473,864 autosomal probes we measured in SEED I participants on the 450k, we identified 11,007 (2.3%) with clustered distributions of DNAm values which we term “gap signals.” These results were generated using a “threshold” value of 0.05 and an “outCutoff” value of 0.01; the following analyses were all conducted considering these arguments and this list of gap signals. The vast majority of gap signals were composed of 2 or 3 clusters of DNAm values (Additional files 2 and 3: Figure S2 and Table S1). For example, the distribution of percent DNAm for cg01802772 clusters into 3 distinct groups (Fig. 1, top panel). Using genotyping data, available from the same SEED individuals, we found that these 3 methylation clusters correspond to genotype for SNP rs299872; this SNP is located at the interrogated C site (Fig. 1, top panel). For this particular probe, we also queried the dbSNP138 database and found that a C/T SNP is annotated as overlapping the interrogated C site (Fig. 1, bottom panel).
Based on our initial gap signal observations, we decided to perform an in-depth analysis of all 11,007 gap signals to characterize the underlying source of these DNAm distributions. Using paired genotype (GWAS) and methylation data, we found that 5453 gap probes (49.5%) contain a SNP from our SEED GWAS dataset, and thus direct evidence for SNP influence. Of the remaining gap probes, 3746 (34.0%) have an (dbSNP13) annotated SNP, in-del, microsatellite, or multi-nucleotide polymorphism or map to a University of California, Santa Cruz (UCSC)-annotated repeat (and thus have indirect evidence for SNP/variant influence), and 1808 gap probes (16.4%) did not contain a SNP from our SEED GWAS dataset and also were not annotated as containing a variant from the dbSNP138 database or a UCSC repeat. Given the large proportion of SNP-associated gap probes, we first sought to examine the role of SNPs in producing gap signals. Our approach to understanding the role of SNPs in producing gap signals consisted of two main elements. First, we theoretically conceptualized how various types of SNPs located at different locations in the probe, including the measured C and corresponding G loci, the SBE site, as well as elsewhere in the probe would affect 450k signal based on our knowledge of the measurement chemistry. Second, we performed empirical analyses using our joint GWAS and 450k DNAm data from SEED. We also examined the remaining ~16% of gap probes that do not have an associated SNP or variant, according to the SEED GWAS data and reference annotations.
Predicted influence of SNPs on 450k DNAm signals
Based on the underlying 450k probe chemistry, we predicted how SNPs influence 450k signal. Our predictions are summarized in Fig. 2. We first predicted the influence on signal of nucleotide changes for SNPs that overlap the C nucleotide of the measured locus. For Type I forward strand probes containing a T/C SNP at the interrogated C site, we predict the no signal in the methylated channel and signal in the unmethylated channel. Thus, the signal readout would be the same as for an unmethylated CpG state. For all other possible SNPs, including A/C and C/G, we would expect no signal to be reported by either the methylated or unmethylated channels for Type I forward strand probes, resulting in no overall signal; these are likely to be detected as failed probes. We predict Type II forward strand probes containing a T/C or A/C SNP at the interrogated C site to result in cyanin5 (Cy5) signal, thus, mimicking an unmethylated state. Type II forward strand probes containing a C/G SNP are predicted to result in cyanin3 (Cy3) signal, thus, mimicking a methylated state. For all reverse strand probes, including Type I and II, any SNP at the interrogated C position results in no signal from either channel, and these probes are also likely to be detected as failures.
Next we predicted the influence of nucleotide changes for SNPs that overlap the G nucleotide of the measured CpG site on signal detection (Fig. 2). For Type I reverse strand probes, we predict an A/G SNP will result in a signal in the unmethylated channel and no signal in the methylated channel, resulting in a similar methylation readout as an unmethylated cytosine nucleotide. Other SNPs, including C/G and T/G, are predicted to result in no detectable signal in either the methylated or unmethylated channels. For Type II reverse strand probes, we would expect a C/G SNP present at the interrogated G site to result in a green signal, the readout for probes with these SNPs thus matches the readout for a methylated state. The presence of an A/G or T/G SNP at the G nucleotide position should result in detection of a red signal; the readout from probes with these SNPs would match the readout for an unmethylated states. Forward strand probes with any type of G-site SNP are not predicted to mask methylation states, but instead they should produce no overall signal.
Finally, we predict the influence of SNPs that overlap the SBE site on signal. For Type II probes, the SBE site overlaps with the interrogated C site; therefore, the influence of SNPs is the same as for C site SNPs, as described above and shown in Fig. 2. In the Type I probe design, the SBE site is immediately adjacent to the interrogated C site; it is one base upstream of forward strand probes and one base downstream for reverse strand probes. The Illumina detection software is programmed to read a pre-defined color channel, which is based on the nucleotide that is expected to be incorporated (defined a priori using a reference sequence). For example, if the base upstream of the interrogated C site is defined as an A nucleotide in the reference sequence, the detection software will only detect signal in the red channel and will not query for a signal in the green channel. Therefore, any SBE-associated SNPs will result in a loss of signal when the incorporated nucleotide is tagged in the opposite color to that dictated by the reference sequence. As shown in Fig. 3, C/G, A/G and T/G genotypes at an SBE-associated SNP will result in loss of signal on forward strand probes. Note that the fluorescence still occurs upon SBE, but the software does not read the signal because it is in a different color channel than what is expected, based on the pre-defined reference sequence. Similarly, C/G, A/C, and T/C SNPs at SBE sites for reverse strand probes are predicted to result in loss of signal (Fig. 3). Several SBE-associated SNPs are also predicted to have no impact on the methylation readout. These include T/A, A/C, and T/C variants for forward strand probes and T/A, A/G, and T/G variants for reverse strand probes (Fig. 3).
SNPs can also occur elsewhere in the probe length; however, it is less straightforward to develop theoretical rules or principles guiding how these may affect probe signal. Similarly, it is unclear how probes with multiple SNPs may behave with respect to methylation signal. Therefore, we do not provide a theoretical framework for these types of probes, but instead provide the results from our empirical analyses below.
Empirical evidence shows SNPs at the interrogated CpG site are related to gap signals
We performed empirical analyses to determine the relationship between DNAm levels reported by 450k and SNPs present at the CpG site using our unified SEED DNAm and genotyping data and compared them to our theoretical expectations, shown in Fig. 2. We identified all of the 450k probes with a measured or imputed SNP present at the interrogated C or corresponding G loci in our SEED sample (n = 5129) (Additional file 4: Table S2). To ensure we were only assessing the influence of our SEED SNPs at CpG sites, we limited our analysis to include only probes with a SNP at the CpG site itself and not elsewhere in the probe length. We found that our empirical SNP results coincided with our predicted results for the SNP scenarios shown in Fig. 2 (Additional file 5: Figures S3–S25). For example, we observed a positive correlation between percent DNAm and dosage of the G allele across the set of 94 probes, including 23 Type I and 71 Type II probes, containing a C/G SNP at the interrogated C locus (Fig. 4a and Additional file 4: Table S2). This appears to be a direct consequence of the positive correlation with methylated probe signal and negative correlation with unmethylated probe signal (Fig. 4b,c). This observation coincides with our prediction for this scenario (Fig. 2) because the addition of the non-reference (G) nucleotide is expected to increase methylated (green) signal at the expense of unmethylated (red) signal. To better conceptualize the effect of a SNP on the total produced signal, i.e., combined methylated and unmethylated signals, we computed a copy number metric (see “Methods” section) and found, in general, it decreased with dosage of the G allele (Fig. 4d). However, the mean copy number metric of the heterozygous group does not lie exactly intermediate between the two homozygous groups, thus highlighting the importance of also considering the methylation state in the interpretation of SNP-influenced 450k probes. For example, in Fig. 1 (top panel), individuals with the “TT” genotype have low methylation values because of their low ratios of methylated to unmethylated intensities dictated by their T allele. Individuals containing one or two copies of the C allele at this SNP can have varying degrees of methylation. In the example shown in Fig. 1, the C alleles are completely methylated for all samples, resulting in discrete DNAm groups. If however, the C alleles were unmethylated, the groups would be largely indistinguishable and form one cluster instead of three. The lack of an explicitly intermediate mean in the heterozygous group for the copy number metric, then, is a consequence of the heterogeneity in methylation at the “C” base at these sites and heterogeneity amongst samples in their methylation state. Additional file 5: Figures S3–S25 contains plots for the remaining SNP scenarios delimited in Fig. 2, and all showed similar relationships.
Empirical evidence shows SBE site SNPs are related to gap signals
We identified all of the 450k probes in SEED with a measured or imputed SNP located at the SBE site (n = 118) (Additional file 4: Table S2). We specifically limited our analyses to probes that contained an SBE-associated SNP exclusively, i.e., there were no SNPs elsewhere in the probe. We found that, overall, our empirical results correspond to our predicted signal for the SNP scenarios shown in Fig. 3 (Additional file 6: Figures S26–S31). For example, we observed an inverse relationship between dosage of the T allele and overall signal across all probes (n = 2) that have a T/G SNP at the SBE site, where G is the a priori defined base at the SBE according to the genome reference sequence (Fig. 5). This observation coincides with our prediction for this scenario (Fig. 3) because a SNP changing the nucleotide at the SBE position from “G” (detected in the green channel) to “T” (detected in the red channel) should result in no signal because the software is programmed, a priori based on reference genome sequence, to report methylation solely as a function of the signal being generated in the green channel. Note that similar to CpG associated SNPs, the mean copy number metric of the heterozygous group does not lie exactly intermediate between the two homozygous groups. This is likely reflecting the heterogeneity in DNA methylation across CpG sites and samples. Overall, our findings using SEED measured and imputed genotypes are consistent with our predictions shown in Fig. 3. However, in certain cases the relationship is less clear. There are a number of potential explanations for nonlinear relationships. First, since the overall signal is a measure of both the ability to detect signal, which as we have shown above can be influenced by SBE site SNPs, and the actual methylation state itself, it is possible that deviations from the expected relationship are related to actual differences in DNAm. These DNAm influences may be exacerbated by the relatively small number of probes examined in each scenario shown in Fig. 3 (all scenarios have ≤17 probes and some scenarios have ≤7 probes; see Additional file 4: Table S2). It is also possible that these nonlinear genotype signal shifts could be related to uncertainty around imputed genotypes.
Probe SNPs up to 20 base pairs from the CpG site are associated with gap signals
Next we sought to specifically assess the relationship between the gap signals we detected via gap hunting and probe SNPs using our unified SEED GWAS and DNAm data. We identified all of the 450k probes in SEED with a measured or imputed SNP located in the probe, excluding those with SNPs at the SBE or CpG sites (n = 33,317). We limited our analysis to probes that contained a single SNP to determine the relationship between SNP distance to the interrogated C site and gap signal. The probes were binned by SNP distance to the interrogated C site, and samples were grouped by genotype: homozygous for the reference allele, heterozygous, and minor allele homozygous. When we plotted the signal intensities, for both the methylated and unmethylated channels, which represent the mean intensity for all probes with a SNP at that particular distance to C site, we observed differences in mean signal intensities, and inter-quartile ranges (25th–75th percentiles), between heterozygotes and homozygotes that were consistent with allelic dosage (Fig. 6). For the Type II probe design, mean intensity differences between the genotype groups are observed up to a SNP distance of about 7–8 base pairs (bp) from the interrogated C site. We also observed that these probe SNP-related differences in signal intensity are less pronounced in the methylated channel compared to the unmethylated channel, where differences in intensity can persist for up to an approximately 20 bp distance. Thus, the unmethylated signal channel appears to be less robust to probe SNPs. Type I probes exhibit a similar behavior, but appear to show greater differences in signal intensity with SNPs and across larger probe distances (Additional file 7: Figure S32). One explanation for this behavior could be that Type I probes are more susceptible to probe SNPs because they were designed under the assumption that the interrogated CpG site and any CpG sites throughout the remainder of the probe length have the same methylation state (Additional file 7: Figure S32).
SNP-affected probes do not always result in gap signals
Our analyses above focus on identifying potential sources of gap signals and show that SNPs can lead to gap signals. Therefore, we also wanted to determine whether probe-associated SNPs always lead to gap signals. We found that not all polymorphism-affected probes result in gap signals (Additional file 4: Table S2). There are 3 main classes of beta distributions in which a probe may be affected by a SNP, but not result in a “gap-like” distribution (Fig. 7). The first occurs when there is a correlation between percent methylation and genotype, but no discrete clusters are observed (Fig. 7a). The second occurs when there are outlier signals, i.e., samples. The gap hunting algorithm was designed to exclude probes from the gap signal list if they likely contained an outlier sample. As a result if the smallest group of samples driving SNP-related gaps is less than the proportion of samples determined by the “outCutoff” argument, these probes will not be flagged as gap signal probes. Figure 7b illustrates this point; it shows that at cg15013523, gap hunting would not identify the group with the “TT” genotype as a discrete cluster because it is comprised of a single sample. These types of probes could be identified as a gap signal if the option to retain “outlier-driven” probes is selected. Finally, beta distributions with an associated SNP in the probe may show no DNAm variability at the site or no correlation with genotype and, therefore, will not result in a “gap-like” distribution (Fig. 7c); this lack of clear genotype correlation was also observed by Daca-Roszak et al. [17] and referred to as a “cloud-like” distribution. Therefore, the potential for a polymorphism-affected probe to be classified as a gap signal is related both to the presence of discrete separation in groups, and to the overall methylation state at the site.
Approximately 16% of gap signals identified in SEED cannot be attributed to an underlying SNP
Finally, among all autosomal probes, we compared the standard deviation distribution between gap and non-gap probes, both with and without associated SNPs, to better characterize gap signals that could not be attributed to an underlying SNP. The 6 mutually exclusive classes of probes we examined include (1) non-gaps with measured or imputed SEED SNPs, (2) non-gaps with annotated variants or repeat elements, (3) non-gaps with no associated variants, (4) gaps with measured or imputed SEED SNPs, (5) gaps with annotated variants or repeat elements, and (6) gaps with no associated variants. As shown in Fig. 8, all non-gap probe distributions, including those with and without an associated SNP, are highly overlapping for both the Type I and Type II designs, suggesting that the majority of non-gap probes have no or low variability in DNAm values similar to the example in Fig. 7c. The gap probe distributions are distinct from the non-gap distributions and show interesting within-group differences (Fig. 8). The gap signals with reference database annotated SNPs exhibit a higher proportion of probes with larger standard deviations than those with SEED measured or imputed SNPs. This is likely due to higher minor allele frequencies of annotated SNPs, generally, compared to the minor allele frequencies of the SEED SNPs (Additional file 8: Figure S33). Another interesting feature of the overall standard deviations is the distinct curve of the 1808 gap signals that lacked any of a measured/imputed SNP, reference database annotated variant, or a UCSC repeat in the probe. As clearly seen in the Type II probe design, there is a high proportion of gap probes without an associated SNP or annotated variant/repeat at low standard deviation values, relative to gap probes containing SNPs (Fig. 8). We also show that gap probes without a SNP or annotated variant/repeat tend to have a higher proportion of 2-cluster probes than gap probes with a SNP (Additional file 9: Table S3).
We were interested in quantifying the degree to which other factors, aside from a measured/annotated SNP or annotated repeat element, could lead to gap-like behavior. For example, it is possible that some of these 2-cluster gap signals ambiguously mapped to sex chromosomes and clustered according to sex; however, we only observed sex-specific clusters in 6 (0.3%) of these probes. 161 of these 1808 probes in total were previously defined as ambiguously mapping [12], 14 were determined to have failed via detection p value, and 210 were found to discriminate blood cell types [18]. These 3 factors account for 379 (21.0%) unique probes of the 1808 gap signals not attributed to an underlying genetic variant.
Other methods to identify clustered DNAm distributions are not as robust as gap hunting
We assessed the potential for other methods to identify clustered DNAm distributions with respect to detection sensitivity and the specific types of sample clusters they identify. We tested these methods against a set of 5000 probes made up of gap signals (which functioned as positive controls) and 5000 probes which were not gap signals, had no measured, imputed, or annotated SNP, and had very low variability (and thus functioned as negative controls). First, using the beta values for these probes, we applied a Gaussian mixture model clustering algorithm, which selects the optimal number of clusters based on the Bayesian information criterion (BIC), and found that it had 100% sensitivity, but only 50% specificity, to distinguish between the gap and non-gap probes. Additionally, in cases where the mixture model predicted a gap signal to (correctly) have more than 1 cluster, it was only able to identify the correct number of clusters 43% of the time. We also examined the utility of the dip test, in which the null hypothesis is that the data are unimodal [19], and found the area under the receiver operating characteristic curve to be 0.73. These methods performed similarly using M-values, with a 100% sensitivity, 67% specificity and 41% correct determination of cluster number using the mixture model, and an area under the curve determined by dip test p values of 0.73. We were then interested in examining the performance of these methods at specific scenarios (Fig. 7) to which gap hunting was insensitive (Additional file 10: Table S4). The mixture model approach was unable to correctly identify that there were 3 relevant clusters in any of these 3 probes, while using the dip test we could only reject the null hypothesis of unimodality at cg14613402 (Fig. 7a); however, this would not be the case if we used a more stringent p value to account for multiple comparisons. Finally, these 2 alternative methods did identify probe cg01802272 (Fig. 1) as a multimodal (dip test p value ≈ 0) as well as having 3 discrete clusters, consistent with our gap hunting approach.
Gap hunting can be useful in addressing population stratification in epigenome-wide association studies
After gaining an understanding of gap signal properties, we were interested in highlighting the potential utility of gap hunting in EWA studies. A recent paper by Barfield et al. [20] demonstrated the ability of principal components (PCs) derived from probes annotated with 1000 Genomes-identified SNPs to correct for population stratification. This method functions under the principal that methylation at these sites will be enriched for genotype-influenced signal and thus serve as a suitable alternative to or surrogate for gold standard correction via genotype-derived PCs [21] in studies where genotype data are unavailable. Given the strong SNP influence on gap signals, we hypothesized that PCs derived from gap signals could be utilized in a similar manner to the Barfield method. Similar to Barfield et al., our gap signal based PCs we able to clearly separate ancestry groups (Fig. 9). This result is expected since most (~85%) gap signals can be attributed to an empirical or reference database annotated SNP/variant, most of which are present in the 1000 Genomes Project that was used by Barfield et al. Additionally, most of the 96 probes identified by Daca-Roszak et al. [17] because they differentiated two ancestral groups exhibit “gap-like” distributions.
Gap signals are enriched in common EWA probe filtering strategies
One approach used in EWA studies to address multiple testing burden is to subset the dataset to only those probes that are variably methylated. We sought to define the proportion of probes in the post-variably methylated filtering dataset that had gap signals identified using gap hunting. As expected, due to our gap signals having inherently high variability, we observed gap signal probe enrichment in the filtered dataset as we increased our standard deviation threshold for filtering (Additional file 11: Figure S34). Enrichment was consistent at various percentile cutoffs of standard deviation across samples. This result emphasizes that researchers should be aware that applying filtering criteria related to probe variability can increase the proportion of gap signals.
Common EWA probe filtering strategies that remove all SNP-associated probes may miss disease-relevant loci
Currently, most EWA studies explicitly remove polymorphism-affected probes that are a priori defined using a reference SNP database or in the Illumina manifest, prior to association analyses. However, based on our findings there are two main concerns with this removal approach. First, it is possible that the SNPs present in reference databases, gathered from many ancestral populations and often includes rare SNPs, may not reflect the genetic architecture among the samples examined in a particular EWA. Second, we have shown that gap signals can be influenced by SNPs and, therefore, gap signals may represent the local genetic structure underlying the interrogated CpG site; thus, they could still be biologically relevant to the outcome of interest, but should be interpreted with caution. This local genetic structure extends beyond that of the interrogated CpG site and 50-mer probe and includes the entire haplotype on which the CpG site exists. For example, cg12162195 exhibits a three-group gap signal, with three SNPs annotated in the probe body (Fig. 10). The samples in each group represent distinct groups of haplotypes; therefore, these methylation groups serve as a surrogate for their respective collections of haplotypes. This is true for all 450k probes that are not located within recombination hotspots. Given our findings that many gap signals have a strong genetic basis underlying the observed differences in methylation, we would expect methylation values at gap probes will capture a larger degree of haplotype diversity than non-gap probes. Therefore, we propose that instead of removing reference database SNP or gap hunting-defined gap signal probes before association analyses, they be included, but flagged, and carefully investigated and interpreted after analyses should they be associated with the outcome of interest.
To examine the difference between our “flag”-based gap hunting approach versus a “remove” reference SNP annotation-based approach in downstream interpretation of EWAS, we ran our EWAS pipeline on publicly available data to evaluate the relationship between placenta methylation and newborn neurobehavior. We identified a total of 11,286 gap signals among 443,825 probes. Using our EWAS pipeline, 56 probes showed suggestive statistical significance (p < 1E − 4) for infant arousal. Of these significant probes, 5 were gap signals (Fig. 11), 15 were annotated as SNP-affected, and 3 of these were both SNP-annotated and gap identified. Thus, an analysis with gap hunting results in 56 hits, 5 of which are flagged as gaps for further investigation and interpretation. Using SNP annotation filtering without gap hunting resulted in 41 hits, 2 of which have suspicious gap-like distributions. Inclusion of all probes in EWAS, with flags for gaps via gap hunting and for SNP annotation, rather than explicitly filtering probes, allows broader consideration of biologically relevant associations and user-specific choices regarding interpretation, rather than omitting potentially relevant findings.