Additional annotation enhances potential for biologically-relevant analysis of the Illumina Infinium HumanMethylation450 BeadChip array
© Price et al.; licensee BioMed Central Ltd. 2013
Received: 17 November 2012
Accepted: 13 February 2013
Published: 3 March 2013
Measurement of genome-wide DNA methylation (DNAm) has become an important avenue for investigating potential physiologically-relevant epigenetic changes. Illumina Infinium (Illumina, San Diego, CA, USA) is a commercially available microarray suite used to measure DNAm at many sites throughout the genome. However, it has been suggested that a subset of array probes may give misleading results due to issues related to probe design. To facilitate biologically significant data interpretation, we set out to enhance probe annotation of the newest Infinium array, the HumanMethylation450 BeadChip (450 k), with >485,000 probes covering 99% of Reference Sequence (RefSeq) genes (National Center for Biotechnology Information (NCBI), Bethesda, MD, USA). Annotation that was added or expanded on includes: 1) documented SNPs in the probe target, 2) probe binding specificity, 3) CpG classification of target sites and 4) gene feature classification of target sites.
Probes with documented SNPs at the target CpG (4.3% of probes) were associated with increased within-tissue variation in DNAm. An example of a probe with a SNP at the target CpG demonstrated how sample genotype can confound the measurement of DNAm. Additionally, 8.6% of probes mapped to multiple locations in silico. Measurements from these non-specific probes likely represent a combination of DNAm from multiple genomic sites. The expanded biological annotation demonstrated that based on DNAm, grouping probes by an alternative high-density and intermediate-density CpG island classification provided a distinctive pattern of DNAm. Finally, variable enrichment for differentially methylated probes was noted across CpG classes and gene feature groups, dependant on the tissues that were compared.
DNAm arrays offer a high-throughput approach for which careful consideration of probe content should be utilized to better understand the biological processes affected. Probes containing SNPs and non-specific probes may affect the assessment of DNAm using the 450 k array. Additionally, probe classification by CpG enrichment classes and to a lesser extent gene feature groups resulted in distinct patterns of DNAm. Thus, we recommend that compromised probes be removed from analyses and that the genomic context of DNAm is considered in studies deciphering the biological meaning of Illumina 450 k array data.
KeywordsInfinium HumanMethylation450 BeadChip array DNA methylation non-specific probes Polymorphic probes CpG islands Annotation CpG enrichment Tissue-specific DNA methylation Repetitive elements 450 k
Measuring epigenetic marks has become an attractive approach for connecting phenotype, genetics and environment in many fields of medicine [1–3]. DNA methylation (DNAm), the addition of a methyl group primarily to cytosines in the context of CpG dinucleotides, is one such highly studied epigenetic mark. Epigenome-wide association studies (EWAS) of DNAm have been proposed as a complement to genome-wide association studies (GWAS) for elucidating loci correlated with complex disease . Although these large-scale association studies provide a great amount of information, there are currently limits to our ability to interpret this data  given the variability of DNAm across individuals, ethnicities, sex, age, tissue type and environment [6, 7]. To improve the analysis potential of a popular tool for large-scale measurement of DNAm, the Infinium HumanMethylation450 BeadChip (450 k) (Illumina, San Diego, CA, USA), we have annotated technically unreliable probes and enhanced the biological annotation of this DNAm microarray.
The 450 k array combines two technically distinct assays in one platform: the Infinium I assay (type I probes) and Infinium II assay (type II probes) (see methods section for details). The design and specifications of the 450 k array have been discussed in other publications [8–10], and extensive probe annotation is available from Illumina to aid users in data interpretation. This annotation includes, for example, probe location within genes (annotated by University of California, Santa Cruz (UCSC) Genome Browser (http://genome.ucsc.edu; UCSC Genome Bioinformatics, Santa Cruz, CA, USA), CpG islands and shores, and regulatory features. However, recently technical limitations of the Infinium platform have been described [11, 12]. In 2011, an evaluation of an earlier version of the array, the Infinium HumanMethylation27k BeadChip (27 k) that used exclusively the Infinium I assay, identified two groups of probes as possibly compromised by their design . The first group, accounting for about 6 to 10% of the 27 k array, was non-specific probes, that is, probes that hybridized to multiple genomic locations in silico. The level of DNAm at non-specific probes likely reflects a combination of DNAm at the various locations to which these probes hybridize. The second group of unreliable probes was those with a polymorphic target (0.24% of 27 k probes). Since the Infinium DNAm platform uses quantitative genotyping of C/T SNPs introduced following bisulfite conversion to determine the level of DNAm, probes with polymorphisms at the target C or G have the potential of assessing a difference in genotype rather than a true difference in DNAm. A corresponding increase in the number of both non-specific probes and polymorphic probes is expected given the similar technology of the 450 k array .
CpG dinucleotides are not randomly distributed throughout the genome, most have spontaneously deaminated with the exception of some CpG-enriched regions known as ‘CpG islands’ . About 70% of gene promoters are associated with CpG islands  and traditionally gene transcription has been thought to be repressed by the presence of promoter CpG island DNAm [15, 16]. There are different approaches for classifying CpG enrichment, for example, UCSC defines CpG islands based on CG content >50%, Observed/Expected (Obs/Exp) CpG ratio >0.6 and length >200 bps . An alternative classification of CpG islands providing more enrichment discrimination is high-density CpG islands (HCs, CG content >55%, Obs/Exp CpG ratio >0.75 and length >500 bps), intermediate-density CpG islands (ICs, CG content >50%, Obs/Exp CpG ratio >0.48 and length >200 bps) and non-islands (LCs or low-density CpG regions, non-HC/IC regions) [16, 18]. However, the most biologically meaningful definition of CpG enrichment remains to be determined.
In the past, many DNAm studies focused on promoters and CpG islands however, recently attention has also turned towards the study of DNAm patterns in the regions surrounding islands, known as shores. CpG island shores have been observed to be variably methylated between unrelated individuals, in cancer and in iPS cell lines [19–21]. The level of DNAm in shores may in fact be more highly correlated with gene expression than that of CpG islands . Furthermore, tissue-specific gene expression has been associated with tissue differences in DNAm at shores , perhaps as a consequence of transcription machinery binding to nearby promoter CpG islands. Others have shown that DNAm outside of CpG islands and shores may also be associated with gene expression. For example, one study identified lower levels of gene body DNAm associated with the lowest and highest levels of gene expression, whereas higher levels of gene body DNAm were associated with intermediate levels of gene expression .
While the 450 k array offers the opportunity to examine DNAm at individual CpGs across CpG island and non-island regions, the inclusion of this diverse range of sites requires a more complex and detailed annotation of the array. To enhance the utility of the 450 k array, we increased probe annotation in four areas: 1) documented SNPs in the target CpG, 2) probe binding specificity, 3) CpG classification of target sites and 4) gene feature classification of target sites. We tested the expanded annotation in a set of control samples of interest to our investigations: adult blood (n = 4), child buccal (n = 4) and placental chorionic villi (n = 4), and followed up some analyses in a larger, publically available blood dataset (n = 261). In particular, we evaluated DNAm patterns relative to functional aspects of probe location, while considering the effects of technically biased probes. Based on our analyses, we recommend that users analyze 450 k data with the following factors in mind: 1) DNAm measured at probes with SNPs in the target site may be compromised by sample genotype, 2) DNAm measured at non-specific probes may not only represent the intended site of hybridization and 3) DNAm varies across CpG enrichment classes as well as gene features.
Results and discussion
Polymorphic CpGs may affect the assessment of DNAm
Infinium assays are based on quantifying bisulfite-introduced C/T SNPs, thus the actual DNA sequence at the target CpG is at risk of compromising the assessment of DNAm. The end of each 450 k probe targets a CpG of interest and although the alignment of type I and type II probes with CpGs differs by one base pair (Additional file 1), end nucleotide match is essential for extension of both probe types. A SNP leading to a sequence change at a target CpG might result in a false DNAm signal due to hybridization of the wrong probe (possible for type I probes) or no/minimal extension at the target site (possible for both probe types). Illumina included annotation of SNPs located within 10 bps of the target CpG (SNP <10 bp, n = 36,535 probes) and those located within the remainder of the probe (SNP >10 bp, n = 59,892 probes). We have added annotation of probes that query CpGs with documented polymorphisms specifically at the C and/or G position (target CpG SNPs).
Theoretically, a bi- or tri-modal distribution of DNAm would be produced by a probe affected by sample genotypes at a target CpG SNP and this pattern would result in a high within-tissue SD in ß value (450 k array measure of DNAm ranging from 0 to 1). Thus, we examined the distribution of within-tissue SD in ß (n = 4/tissue) at probes annotated with a target CpG SNP, SNP <10 bp (excluding those probes also annotated with a target CpG SNP) and SNP >10 bp (Figure 1B, results for blood). The distribution of SD in ß for probes annotated with a target CpG SNP was most different (p= 1.78 × 19-15) from that of all probes based on a Kolmogorov-Smirnov (KS) test for difference in distribution. This trend was illustrated by a shift in the density curve for SD in ß of probes with target CpG SNPs in comparison to the curve for SD in ß of all probes (Figure 1B). To ensure that this finding was not an artifact of our small sample size, we performed the same analysis using a larger, publically available dataset, Gene Expression Omnibus (GEO) [GSE:40279], that had investigated age-associated DNAm changes in the blood of 656 individuals (aged 19 to 101 years) . We extracted the younger half of samples (n = 261, aged 19 to 61 years) for our analysis since this roughly covered the age of the blood samples in our study. In this larger dataset (referred to as the ‘aging dataset’), the distribution of SD in ß for probes annotated with a target CpG SNP also exhibited the largest difference in distribution from that of all probes (p = 1.22 × 10-14) based on a KS test (Figure 1C).
The majority of highly variable probes were annotated with SNPs
Highly variable probesa
Annotated with target CpG SNP
To confirm that a target CpG SNP could affect DNAm, samples were genotyped at a probe (cg06961873) that had an annotated target CpG SNP and SD in ß ≥0.25 in all three tissues. As predicted, homozygous C samples were assessed as hypermethylated, heterozygotes were assessed as approximately 50% methylated and homozygous T samples as hypomethylated (Figure 1D). Although we were not able to genotype samples, a histogram of DNAm at this same CpG site across the 261 aging dataset samples showed the same trimodal pattern of DNAm (Additional file 3). Other examples of highly variable probes in the aging dataset also illustrate this pattern (Additional file 3).
Given the demonstrated potential to bias the call of DNAm, we suggest that probes with a target CpG SNP should be disregarded in most analyses of the 450 k array. At minimum, 450 k users should carefully check candidate probes against the target CpG SNP annotation in addition to a current SNP database, as more polymorphisms are likely to be identified and validated in coming years. Although we have used a straightforward example to illustrate how a target CpG SNP may confound the assessment of DNAm, effects may also be observed at SNPs within the remainder of the probe, that is, outside of the target CpG. For example, polymorphisms throughout the interval of hybridization have been shown to affect the binding of probes used in Illumina mRNA expression arrays , which have the same probe lengths as the 450 k array. Similar effects have also been observed in Affymetrix mRNA expression arrays (Affymetrix, Santa Clara, CA, USA), although these use shorter probes that might be more sensitive to sequence mismatches . Additionally, several studies have recognized the heritability of DNAm through the genetic-epigenetic interaction of methylation-associated SNPs (mSNPs) [27, 28], suggesting that some SNP-associated differences in DNAm may be true differences and not due to technical artifacts.
8 to 9% of probes mapped to more than one location in silico
Location of in silico cross-hybridization of non-specific probes
Intended probe target: auto chrs
Intended probe target: sex chrs
Total on array
Cross-hybridize only to auto chrs
Cross-hybridize only to sex chrs
Cross-hybridize to auto and sex chrs
Total: cross-hybridize to sex chrs
Total: cross-hybridize to auto chrs
In the aging dataset, after excluding sex chromosome probes, but not our annotated non-specific probes, we used a false discovery rate (FDR) and minimum difference in DNAm (Δß) between sexes to identify autosomal probes that were differentially methylated between males (n = 133) and females (n = 128). An FDR of <1% and minimum Δß of 10% identified 75 sex-specific autosomal probes of which 40% were annotated to cross-hybridize to the sex chromosomes (Additional file 4). Although some true sex differences in DNAm likely exist on the autosomes, this result indicates that many of the large autosomal sex differences in DNAm may be an artifact of probe design and likely actually represent sex-chromosome differences in DNAm. Depending on the research question, some investigators may choose to exclude all or a subset of non-specific probes prior to data analysis, while others may include them and follow-up candidate probe specificity before establishing conclusions.
Homologous gene families, duplicated genes or repetitive elements have been proposed as potential causes of in silico cross-hybridization of Infinium probes . Thus, for all 450 k probes, we annotated the number of nucleotides at the intended site of hybridization that mapped to repetitive DNA based on RepeatMasker (http://www.repeatmasker.org; RepeatMasker, Institute for Systems Biology, Seattle, WA, USA) annotation in BLAT . For 72,957 probes (15.0% of 450 k probes), more than half of the nucleotides in the probe (>25 bps) was targeted to repetitive DNA. We had annotated 19,731 of these repetitive probes as non-specific, which reflects their in silico cross-hybridization. Interestingly, for 24,847 specific probes (that is, mapped only to the intended target), the entire probe (50 bps) was in repetitive DNA. This group of specific repetitive probes might be exploited to assess DNAm of repetitive elements; of interest to studies investigating changes in DNAm in cancer or in association with environmental exposure [31, 32].
Comparing Illumina and HIL annotation of probes highlighted differences between CpG classification systems
As previously mentioned, the 450 k array includes probes designed to target UCSC CpG islands, as well as shores, shelves and non-island regions, which we refer to as the ‘sea’  (Additional file 5A, see methods for class definitions). Alternative ‘HIL’ CpG classes (that is, high-density CpG island (HC), intermediate-density CpG island (IC) and non-island (LC)) provide a different criterion for probe annotation based on CpG enrichment. We expanded the 450 k annotation by categorizing probes into four HIL classes: 1) HC probes, 2) IC probes, 3) ICshore probes (regions of intermediate-density CpG island that border HCs) and 4) LC probes (Additional file 5B, see methods for class definitions) [16, 18].
To elucidate potential functional differences between CpG classes, we examined the distribution of DNAm for both Illumina and HIL-annotated CpG classes (for blood, Additional file 7; buccal, Additional file 8 and chorionic villi, Additional file 9). Within each classification system, all distribution curves were significantly different from each other. On average, KS statistics were larger for comparisons between HIL CpG classes than for Illumina CpG classes (Additional files 7, 8, 9), indicative of more distinct distributions of DNAm in HIL CpG classes.
A goal of the additional CpG classification of 450 k probes was to identify biologically-relevant structures that might underlie genome-wide changes in DNAm. The HIL CpG classes demonstrated a more extreme DNAm profile and larger percentage of tDM probes which may be reflective of biological processes. Intriguingly, even though ICs and ICshores have the same CpG density, distinct differences between these two classes emerged in our analyses, suggesting that ICs that border HCs are distinct from ICs on their own, highlighting the utility of this additional classification.
DNAm was variable across nine gene feature groups
We were also interested in assessing where tissue-specific differences in DNAm occurred based on gene features. Thus, we examined tDM probes for enrichment within each gene feature group (again separated by CpG class, Additional file 13). tDM probes in first exons were significantly depleted in 5’UTRs located in HCs and ICshores, but significantly enriched in 5’UTRs located in LCs. HC exons were significantly enriched for tDM probes in 5’UTR, body and 3’UTR across all tissue comparisons, perhaps due to biological significance or small probe numbers in these categories. Although CpG classes were primarily associated with differences in DNAm, gene structure is also an important factor to consider when analyzing 450 k array results.
With the advent of next-generation sequencing applied to bisulfite converted samples, measurement of DNAm will truly be possible on a genome-wide, sequence-specific scale. However, difficulties currently lie in the alignment of reduced complexity reads as well as biologically-relevant interpretation of data . Array-based technologies, which target specific genomic regions of interest, are of value for assessing physiologically-relevant changes in studies of human health and disease. Detailed and comprehensive annotation of locus-specific arrays is paramount to successful analysis and interpretation of data.
In this article, we presented an expanded annotation of the 450 k array including both compromised probe annotation and additional biologically-relevant annotation. Our expanded annotation has been deposited as a platform on the NCBI GEO (http://www.ncbi.nlm.nih.gov/geo) under the accession [GSE:42409]. Based on our findings, we suggest that all 450 k users analyze data with the following factors in mind. Probe signals may be biased by the presence of SNPs in the target CpG and/or binding of probes to multiple genomic locations. SNPs at the target CpG may be especially problematic in studies with small sample size, as chance may result in dramatic differences in the frequency of polymorphisms between groups. However, false positives may still result in studies with larger sample sizes, if groups are not ethnically matched. Additionally, DNAm patterns within CpG enrichment classes or gene features could overshadow findings between study groups if probes are not separated and considered within these genomic features.
There are certainly other methods and filters that can be applied to 450 k array analyses that were not touched upon in this article. Notably, a recent study excluded 450 k probes that mapped to copy-number variations (CNVs) because of the potential to bias measurement of DNAm . Furthermore these authors set a criterion of ‘comethylation’ to identify differentially methylated regions, that is, all probes within a 250 bp window had to show the same trend in DNAm. As more studies using the 450 k array are published, we will undoubtedly see various combinations of applied filters and methodological practices for data analysis. In this era of extensive data collection using such high-throughput assays, it is vital that the type of biological sample as well as the research question is carefully considered in relation to downstream analytical choices as well as the technical platform.
To complete the expanded annotation, we calculated additional probe location information based on the Illumina-provided MAPINFO GenomeStudio column (location of the C in the target CpG): 1) the interval of the target CpG (CpG), 2) the interval containing the probe but excluding the target CpG (Probew/o CpG) and 3) the interval of the entire probe (entire probe) (Additional files 1 and 14). Probe type (type I vs type II) and strand of design (F or R) were taken into consideration when calculating genomic location. Ten type I and ten type II probes were manually checked against the annotated probe sequence. A UCSC track was created containing the targeted Cs on the 450 k array (Additional file 15). All of the annotation and analysis of the expanded annotation was conducted on 485,512 probes, including both cg (CpG loci) and ch (non-CpG loci) probes but excluding rs (SNP assay) probes, unless otherwise specified.
The dbSNP131 table was imported into Galaxy (http://galaxyproject.org; Galaxy, Pennsylvania State University, PA, USA) from UCSC . Only rs numbers for SNPs that were an interval of 1 bp in length and of the highest quality (weight = 1) were included in the annotation. An interval file was uploaded into Galaxy using the hg19 location we annotated for the interval of each probe spanning the C and G of the target CpG for cg probes only. The probe file was intersected with the dbSNP131 table to create a list of probes with documented SNPs in the C or G of the target CpG (target CpG SNP). This file was collapsed in R (http://www.r-project.org; R Foundation for Statistical Computing, Vienna University of Economics and Business, Vienna, Austria) to create a list of rs numbers for each probe, since some target CpGs were documented with more than one SNP. The rs numbers for SNPs in the target CpG were included in the expanded annotation in the ‘target CpG SNP’ column (n = 20,270), while the number of SNPs/probes was annotated in the ‘n_target CpG SNP’ column.
Non-specific probe annotation
To identify probes that potentially have multiple genomic targets (non-specific probes), we followed the method described by Chen et al.. Special treatment of type II probes was required as the Illumina annotation has noted Cs in CpGs within the probe as an ‘R’ SNP. For type II probes that contained Rs we considered two probe sequence versions, one with all Rs replaced by As and the other with all Rs replaced by Gs. Using these conditions, we matched each of the 450 k probes with the Illumina-annotated genomic location (intended target).
Briefly, we used BLAT  to align probe sequences to four versions of the hg19 draft sequence genome: 1) a fully unmethylated ‘bisulfite treated’ genome, with all Cs converted to Ts; 2) a fully methylated ‘bisulfite treated’ genome, with only non-CpG Cs converted to Ts; 3) and 4) were the above treatments on the reverse complement sequence. BLAT was run using the following parameters: stepSize = 5, wordsize = 11 and repMatch = 1,000,000; lowering the word length led to only fractionally more hits. The selection criterion used was as previously outlined: for a probe to be considered non-specific, there had to be 90% identity over the aligned region, at least 40 of 50 matching bps, no gaps, and the 50th nucleotide had to align, as the probe hybridizes to the target CpG at this position . The number of non-specific probes hits were annotated in the expanded annotation ‘AlleleA_Hits, AlleleB_Hits’ columns, while the site of cross-hybridization was annotated in the columns ‘XY_Hits’ (if at least one hit was on a sex chromosome) and ‘Autosomal_Hits’ (if at least one hit was on an autosomal chromosome). Repetitive sequences from RepeatMasker were marked in lowercase in the four genomes. Thus we identified the amount of repetitive DNA within the Illumina-intended alignment of each probe in the expanded annotation column ‘n_bp_repetitive’.
CpG enrichment annotation
Illumina categorized probes in CpG islands (GenomeStudio column ‘Relation_to_UCSC_CpG_Island’) based on the UCSC Genome Browser criteria of CG content >50%, Obs/Exp CpG ratio >0.60 and length >200 bps. Shores and shelves were identified based on their relationship to a CpG island; shores as the 2 kbs up- and down-stream of CpG islands and shelves as the 2 kbs outside of shores. The remaining probes were located in non-island regions, which we refer to as the ‘sea’  (Additional file 5A).
We annotated probes into four HIL CpG classes based on alternative CpG enrichment criteria: high-density CpG island probes (HC, n = 153,859), intermediate-density CpG island probes (IC, n = 118,727), ICshore probes (probes in ICs that border HCs, n = 33,955) and non-island probes (LC, n = 178,971) (Additional file 5B). This annotation has been added in the ‘HIL_CpG_class’ column of the expanded annotation. To locate probes within each of the four CpG classes, we first annotated these CpG enrichment classes throughout the genome. The hg19 genomic sequence was downloaded from UCSC in overlapping segments and read by CpGIE, a Java software program . CpGIE searches input sequences in sliding windows based on user-set criteria. HCs were defined as regions with CG content >55%, Obs/Exp CpG ratio >0.75 and length >500 bps, while ICs were defined as regions with CG content >50%, Obs/Exp CpG ratio >0.48 and length >200 bps [16, 18]. CpGIE HC and IC output was merged into a single file for each chromosome, duplicate islands were removed and CpG islands were identified as follows: ICs, isolated regions of the genome with IC density; ICshores, regions of the genome with IC density that were next to regions with HC density; HCs, any region of the genome with HC density; and LCs, regions that were not of IC or HC density. Islands were given unique names in the annotation, for example, chr8_IC:49890018–49891221 (chr#_CpG class: genomic start–genomic end). The hg19 HC and IC islands have been complied into a UCSC track available in Additional file 16. The hg19 HIL annotation was intersected with the genomic location (hg19) of 450 k targets in Galaxy to assign probes into the four CpG classes. An annotation of probes into HIL CpG islands using the detailed nomenclature can be found in the expanded annotation column ‘HIL_CpG_Island_Name’.
Gene feature and TSS annotation
Using the NCBI Reference Sequence (RefSeq) gene annotation, we annotated probes into nine groups based on three gene components (first exons, exons and introns) and three gene regions (5’UTR, body and 3’UTR). Probes were grouped into: 1) 5’UTR first exons, 2) 5’UTR exons, 3) 5’UTR introns, 4) body first exons, 5) body exons, 6) body introns, 7) 3’UTR first exons, 8) 3’UTR exons and 9) 3’UTR introns (Figure 5). Briefly, the hg19 RefSeq table was downloaded from UCSC . Exon and intron information was extracted and parsed into genomic interval data with the most upstream exon denoted as the first exon. Next, 5’UTR, gene body and 3’UTR location was parsed into genomic interval data utilizing the transcription start/stop and coding start/stop information from RefSeq. Intersection was performed between each of 5’UTR, gene body and 3’UTR with first exon, exon and intron intervals to generate the nine gene features. The gene feature intervals were then intersected with the hg19 genomic location of 450 k targets in R to assign probes into the nine gene features. This annotation was completed using both RefSeq gene names and transcript names. Gene feature annotation was conducted using the GenomicRanges package in R .
The hg19 UCSC knownGene table  was downloaded to Galaxy and the closest TSS for each probe was annotated, regardless of whether the probe was located within the same gene. For each probe, the distance to the closest TSS, gene name and transcript name was noted in the expanded annotation columns ‘Closest_TSS’, ‘Distance_closest_TSS’, ‘Closest_TSS_gene_name’ and ‘Closest_TSS_Transcript’.
Two male and two female chorionic villus samples were collected through the BC Women’s Hospital & Health Centre, Vancouver, BC, Canada, as controls for a study of chromosomal abnormalities in the placenta. DNA was extracted from a small piece of chorionic villi as previously described . For each placental sample (n = 4), DNA from two independent chorionic villi was combined in equal amounts prior to bisulfite conversion to ensure a representative sample of the placenta. DNA was extracted by standard salt method. Two male and two female blood samples were collected as adult controls for ongoing studies of respiratory disease and epigenetics (n = 4). Peripheral blood mononuclear cell (PBMC) DNA was extracted according to standard procedures. Buccal epithelial samples were collected from two males and two females for a study on maternal care effects on childhood DNAm (n = 4). Buccal samples were collected using Isohelix DNA Buccal Swabs (Cell Projects Ltd, Harrietsham, Kent, UK), and stabilization reagents and DNA were extracted using Isohelix DNA Isolation Kits (Cell Projects Ltd) as per the manufacturer’s protocols.
Illumina 450 k array
Two ug of genomic DNA was purified using the DNeasy Blood & Tissue Kit (Qiagen, Valencia, CA, USA) following the manufacturer’s protocol. Purified DNA quality and concentration were assessed with a NanoDrop ND-1000 (Thermo Scientific, Waltham, MA, USA) prior to bisulfite conversion. One ug of purified genomic DNA was bisulfite converted using the EZ DNA Methylation Kit (Zymo Research, Orange, CA, USA) following the manufacturer’s protocol. Bisulfite DNA quality and concentration were assessed using the NanoDrop and, if required, samples were concentrated to approximately 50 ng/ul using a SpeedVac (Thermo Electron Corporation, Waltham, MA, USA). Following the Illumina 450 k array protocol, 4 ul of bisulfite converted sample was whole-genome amplified, enzymatically digested, hybridized to the array and then single nucleotide extension was performed .
Two assay types are used by the 450 k array to measure DNAm: Infinium I (type I probes) and Infinium II (type II probes), bound to beads scattered throughout the array. When a probe successfully binds to DNA, a single fluorescently labeled nucleotide extends off the probe and this signal is read by an Illumina scanner. The Infinium I assay uses two bead types specific to the CpG of interest: an unmethylated (u) and a methylated (m) bead, each with a different probe design (ProbeA (u) and ProbeB (m)). Both type I probes for a given CpG fluoresce in the same color channel (either red (Cy5) or green (Cy3)). The Infinium II assay uses only one bead type for each CpG of interest, an m + u bead. One probe is designed for each type II target site and the color of fluorescence is based on which nucleotide is incorporated in the single base extension step. The incorporation of an A or T signals an unmethylated site in red (u) and the incorporation of a C or a G signals a methylated site in green (m) .
Chips were scanned using an Illumina HiScan on a two-color channel to detect Cy3 labeled probes on the green channel and Cy5 labeled probe on the red channel. Illumina GenomeStudio Software 2011.1 was used to read the array output and conduct background normalization. The signalA, signalB and probe intensity were exported for autosomal probes and read into R. M values were generated using the Bioconductor (http://www.bioconductor.org; Fred Hutchinson Cancer Research Center, Seattle, WA, USA) methylumi package, M = log2(intensity m + 1/intensity u + 1) since this value has been shown to be valid for statistical analyses . Following correction for chip to chip color bias using the Bioconductor lumi package  and probe type correction using subset-quantile within array normalization (SWAN) , M values were converted to ß values using the equation ß = (2M/(2M + 1)). The ß value is a number ranging from 0 to 1 that is directly proportional to percentage DNAm; thus to ease interpretation, we have reported results as ß values. The microarray data used in this article was submitted to the NCBI GEO under accession number [GSE:42409]. Probes with a detection p value >0.01 in any sample, probes with no ß value in any sample, all rs and ch probes, all sex chromosome and non-specific probes were removed prior to analyses. The level of DNAm for 428,216 probes in our sample dataset was intersected with the expanded annotation for further analyses.
Processing of aging dataset
Series matrix files were downloaded for [GSE:40279] containing ß values for 473,039 probes per sample . We worked with the subset of samples that roughly matched the age of the samples used in our study (n = 261, aged 19 to 61 years). Probes with no ß value in any sample, all sex chromosome probes, all rs and all ch probes were removed from the dataset. For SNP analyses, non-specific probes were also removed, however these were retained in the analysis of autosomal sex-specific probes. For the discovery of autosomal probes with sex differences in DNAm, ß values were read into R, converted into M values using the Bioconductor package lumi  and then significance analysis of microarrays (SAM) was conducted using the Bioconductor package siggenes . At FDR <1%, 10,139 autosomal probes were identified as significantly different between male and female samples. Next, this list was crossed with a list of Δß values for each probe calculated by taking the absolute value of the difference between average ß of males and average ß of females.
Probe cg06961873 was selected for genotype validation of SNP rs61775206 in each sample. Primers were designed using PSQ Assay Design software version 1.0.6 (Biotage AB, Uppsala, Sweden). Primer sequences and probe information are available in Additional file 17. Using the following conditions, 0.5 ul of genomic DNA was PCR-amplified: 95°C for 5 minutes, (95°C for 20 seconds, 55°C for 20 seconds, 75°C for 20 seconds) × 50, 72°C for 5 minutes. Genotyping was performed using a PyroMark MD system (Biotage AB) and analyzed with PSQ 96MA SNP software (Biotage AB).
A KS test was used to assess the difference in distribution of SD in ß values for probes that contained SNPs. The KS statistic represents the maximum absolute difference between the cumulative distributions of two functions. Probes with small within-tissue SD in ß (<0.10) were removed from all probe groups to increase the power of the analysis. Probes with a target CpG SNP were removed from the SNP <10 bp group. The number of probes included in the SD in ß distribution curves for blood samples was 5,450 for all probes, 809 for SNP >10 bp, 402 for SNP <10 bp and 2,190 for target CpG SNP, and for the aging dataset was 6,267 for all probes, 1,022 for SNP >10 bp, 362 for SNP <10 bp and 2,753 for target CpG SNP. KS tests were also used to assess the difference in distribution of DNAm between Illumina CpG classes and between HIL CpG classes. Fisher’s exact test was used to compare the distribution of the number of probes within the three levels of DNAm for both Illumina and HIL-annotated CpG classes: hypomethylated (ß values of 0 to ≤0.2), heterogeneously methylated (ß values of >0.2 to <0.8) and hypermethylated (ß values of ≥0.8 to 1.0). Enrichment analyses of tDM probes were performed in Python (Python Software Foundation). To select tDM probes, DNAm was first averaged for each probe within a tissue. A z-score was calculated for each probe comparison between tissues. A p value cutoff of 0.05 was selected with a Bonferroni correction to account for repeated comparisons . KS and Fisher’s exact tests were performed in R. Statistical significance was considered as tests with p values <1.0 × 10-7. All figures were created in R and Adobe Illustrator CS6.
- 27 k:
Infinium HumanMethylation27k BeadChip
3’ untranslated region
- 450 k:
Infinium HumanMethylation450 BeadChip
5’ untranslated region
database of single nucleotide polymorphisms
epigenome-wide association studies
false discovery rate
Gene Expression Omnibus
genome-wide association studies
high-density CpG island
high-density CpG island (HC), intermediate-density CpG island (IC) and non-island (LC) or low-density regions
intermediate-density CpG island
intermediate-density CpG island shore
National Center for Biotechnology Information
peripheral blood mononuclear cell
polymerase chain reaction
significance analysis of microarrays
single nucleotide polymorphism
subset-quantile within array normalization
differentially methylated between tissues
transcription start site
University of California, Santa Cruz.
The authors would like to thank Courtney Hanna, Dr Kirsten Hogg, Samantha Peeters, Dr Maria Peñaherrera and John Blair for suggestions and editorial assistance, and Sarah Neumann for aid in array processing. MSK is a Fellow of the Canadian Institute for Advanced Research (CIFAR), Experience-Based Brain and Biological Development program, and a Scholar of the Djavad Mowafaghian Foundation. EE is also a Fellow of the CIFAR. This research was supported by grants to WPR from the Canadian Institutes for Health Research (CIHR, grant MOP-106430), to MSK from the NeuroDevNet, a Canadian Network of Centres of Excellence, and to CJB from the CIHR (grant MOP-13690). EE’s laboratory is supported by a grant from the National Sciences and Engineering Council of Canada (NSERC).
- Dempster EL, Pidsley R, Schalkwyk LC, Owens S, Georgiades A, Kane F, Kalidindi S, Picchioni M, Kravariti E, Toulopoulou T, Murray RM, Mill J: Disease-associated epigenetic changes in monozygotic twins discordant for schizophrenia and bipolar disorder. Hum Mol Genet. 2011, 20 (24): 4786-4796. 10.1093/hmg/ddr416.PubMed CentralView ArticlePubMed
- Hansen KD, Timp W, Bravo HC, Sabunciyan S, Langmead B, McDonald OG, Wen B, Wu H, Liu Y, Diep D, Briem E, Zhang K, Irizarry RA, Feinberg AP: Increased methylation variation in epigenetic domains across cancer types. Nat Genet. 2011, 43 (8): 768-775. 10.1038/ng.865.PubMed CentralView ArticlePubMed
- Aston KI, Punj V, Liu L, Carrell DT: Genome-wide sperm deoxyribonucleic acid methylation is altered in some men with abnormal chromatin packaging or poor in vitro fertilization embryogenesis. Fertil Steril. 2012, 97 (2): 285-292. 10.1016/j.fertnstert.2011.11.008.View ArticlePubMed
- Rakyan VK, Down TA, Balding DJ, Beck S: Epigenome-wide association studies for common human diseases. Nat Rev Genet. 2011, 12 (8): 529-541. 10.1038/nrg3000.PubMed CentralView ArticlePubMed
- Heijmans BT, Mill J: Commentary: The seven plagues of epigenetic epidemiology. Int J Epidemiol. 2012, 41 (1): 74-78. 10.1093/ije/dyr225.PubMed CentralView ArticlePubMed
- Foley DL, Craig JM, Morley R, Olsson CA, Dwyer T, Smith K, Saffery R: Prospects for epigenetic epidemiology. Am J Epidemiol. 2009, 169 (4): 389-400.PubMed CentralView ArticlePubMed
- Lam LL, Emberly E, Fraser HB, Neumann SM, Chen E, Miller GE, Kobor MS: Factors underlying variable DNA methylation in a human community cohort. Proc Natl Acad Sci U S A. 2012, 109 (Suppl 2): 17253-17260.PubMed CentralView ArticlePubMed
- Dedeurwaerder S, Defrance M, Calonne E, Denis H, Sotiriou C, Fuks F: Evaluation of the Infinium Methylation 450 K technology. Epigenomics. 2011, 3 (6): 771-784. 10.2217/epi.11.105.View ArticlePubMed
- Sandoval J, Heyn H, Moran S, Serra-Musach J, Pujana MA, Bibikova M, Esteller M: Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics. 2011, 6 (6): 692-702. 10.4161/epi.6.6.16196.View ArticlePubMed
- Chen YA, Choufani S, Ferreira JC, Grafodatskaya D, Butcher DT, Weksberg R: Sequence overlap between autosomal and sex-linked probes on the Illumina HumanMethylation27 microarray. Genomics. 2011, 97 (4): 214-222. 10.1016/j.ygeno.2010.12.004.View ArticlePubMed
- Zhang X, Mu W, Zhang W: On the analysis of the illumina 450 k array data: probes ambiguously mapped to the human genome. Front Genet. 2012, 3: 73.PubMed CentralPubMed
- Morris T, Lowe R: Report on the Infinium 450 k methylation array analysis workshop: April 20, 2012 UCL, London, UK. Epigenetics. 2012, 7 (8): 961-962. 10.4161/epi.20941.PubMed CentralView ArticlePubMed
- Ioshikhes IP, Zhang MQ: Large-scale human promoter mapping using CpG islands. Nat Genet. 2000, 26 (1): 61-63. 10.1038/79189.View ArticlePubMed
- Saxonov S, Berg P, Brutlag DL: A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci U S A. 2006, 103 (5): 1412-1417. 10.1073/pnas.0510310103.PubMed CentralView ArticlePubMed
- Hsieh CL: Dependence of transcriptional repression on CpG methylation density. Mol Cell Biol. 1994, 14 (8): 5487-5494.PubMed CentralView ArticlePubMed
- Weber M, Hellmann I, Stadler MB, Ramos L, Paabo S, Rebhan M, Schubeler D: Distribution, silencing potential and evolutionary impact of promoter DNA methylation in the human genome. Nat Genet. 2007, 39 (4): 457-466. 10.1038/ng1990.View ArticlePubMed
- Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J Mol Biol. 1987, 196 (2): 261-282. 10.1016/0022-2836(87)90689-9.View ArticlePubMed
- Cotton AM, Lam L, Affleck JG, Wilson IM, Penaherrera MS, McFadden DE, Kobor MS, Lam WL, Robinson WP, Brown CJ: Chromosome-wide DNA methylation analysis predicts human tissue-specific X inactivation. Hum Genet. 2011, 130 (2): 187-201. 10.1007/s00439-011-1007-8.PubMed CentralView ArticlePubMed
- Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M, Ji H, Potash JB, Sabunciyan S, Feinberg AP: The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat Genet. 2009, 41 (2): 178-186. 10.1038/ng.298.PubMed CentralView ArticlePubMed
- Doi A, Park IH, Wen B, Murakami P, Aryee MJ, Irizarry R, Herb B, Ladd-Acosta C, Rho J, Loewer S, Miller J, Schlaeger T, Daley GQ, Feinberg AP: Differential methylation of tissue- and cancer-specific CpG island shores distinguishes human induced pluripotent stem cells, embryonic stem cells and fibroblasts. Nat Genet. 2009, 41 (12): 1350-1353. 10.1038/ng.471.PubMed CentralView ArticlePubMed
- Zhang D, Cheng L, Badner JA, Chen C, Chen Q, Luo W, Craig DW, Redman M, Gershon ES, Liu C: Genetic control of individual differences in gene-specific methylation in human brain. Am J Hum Genet. 2010, 86 (3): 411-419. 10.1016/j.ajhg.2010.02.005.PubMed CentralView ArticlePubMed
- Ji H, Ehrlich LI, Seita J, Murakami P, Doi A, Lindau P, Lee H, Aryee MJ, Irizarry RA, Kim K, Rossi DJ, Inlay MA, Serwold T, Karsunky H, Ho L, Daley GQ, Weissman IL, Feinberg AP: Comprehensive methylome map of lineage commitment from haematopoietic progenitors. Nature. 2010, 467 (7313): 338-342. 10.1038/nature09367.PubMed CentralView ArticlePubMed
- Jjingo D, Conley AB, Yi SV, Lunyak VV, Jordan IK: On the presence and role of human gene-body DNA methylation. Oncotarget. 2012, 3 (4): 462-474.PubMed CentralView ArticlePubMed
- Hannum G, Guinney J, Zhao L, Zhang L, Hughes G, Sadda S, Klotzle B, Bibikova M, Fan JB, Gao Y, Deconde R, Chen M, Rajapakse I, Friend S, Ideker T, Zhang K: Genome-wide Methylation Profiles Reveal Quantitative Views of Human Aging Rates. Mol Cell. 2013, 49 (2): 359-367. 10.1016/j.molcel.2012.10.016.PubMed CentralView ArticlePubMed
- Barbosa-Morais NL, Dunning MJ, Samarajiwa SA, Darot JF, Ritchie ME, Lynch AG, Tavare S: A re-annotation pipeline for Illumina BeadArrays: improving the interpretation of gene expression data. Nucleic Acids Res. 2010, 38 (3): e17-10.1093/nar/gkp942.PubMed CentralView ArticlePubMed
- Benovoy D, Kwan T, Majewski J: Effect of polymorphisms within probe-target sequences on olignonucleotide microarray experiments. Nucleic Acids Res. 2008, 36 (13): 4417-4423. 10.1093/nar/gkn409.PubMed CentralView ArticlePubMed
- Fraser HB, Lam LL, Neumann SM, Kobor MS: Population-specificity of human DNA methylation. Genome Biol. 2012, 13 (2): R8-10.1186/gb-2012-13-2-r8.PubMed CentralView ArticlePubMed
- Bell JT, Pai AA, Pickrell JK, Gaffney DJ, Pique-Regi R, Degner JF, Gilad Y, Pritchard JK: DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biol. 2011, 12 (1): R10-10.1186/gb-2011-12-1-r10.PubMed CentralView ArticlePubMed
- Blair JD, Price EM: Illuminating Potential Technical Artifacts of DNA-Methylation Array Probes. Am J Hum Genet. 2012, 91 (4): 760-762. 10.1016/j.ajhg.2012.05.028.PubMed CentralView ArticlePubMed
- Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-664.PubMed CentralView ArticlePubMed
- Baccarelli A, Wright RO, Bollati V, Tarantini L, Litonjua AA, Suh HH, Zanobetti A, Sparrow D, Vokonas PS, Schwartz J: Rapid DNA Methylation Changes after Exposure to Traffic Particles. Am J Respir Crit Care Med. 2009, 179 (7): 572-578. 10.1164/rccm.200807-1097OC.PubMed CentralView ArticlePubMed
- Rusiecki JA, Baccarelli A, Bollati V, Tarantini L, Moore LE, Bonefeld-Jorgensen EC: Global DNA hypomethylation is associated with high serum-persistent organic pollutants in Greenlandic Inuit. Environ Health Perspect. 2008, 116 (11): 1547-1552. 10.1289/ehp.11338.PubMed CentralView ArticlePubMed
- Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, Haefliger C, Horton R, Howe K, Jackson DK, Kunde J, Koenig C, Liddle J, Niblett D, Otto T, Pettett R, Seemann S, Thompson C, West T, Rogers J, Olek A, Berlin K, Beck S: DNA methylation profiling of human chromosomes 6, 20 and 22. Nat Genet. 2006, 38 (12): 1378-1385. 10.1038/ng1909.PubMed CentralView ArticlePubMed
- Smith ZD, Chan MM, Mikkelsen TS, Gu H, Gnirke A, Regev A, Meissner A: A unique regulatory phase of DNA methylation in the early mammalian embryo. Nature. 2012, 484 (7394): 339-344. 10.1038/nature10960.PubMed CentralView ArticlePubMed
- Brenet F, Moh M, Funk P, Feierstein E, Viale AJ, Socci ND, Scandura JM: DNA methylation of the first exon is tightly linked to transcriptional silencing. PLoS One. 2011, 6 (1): e14524-10.1371/journal.pone.0014524.PubMed CentralView ArticlePubMed
- Laird PW: Principles and challenges of genomewide DNA methylation analysis. Nat Rev Genet. 2010, 11 (3): 191-203.View ArticlePubMed
- Beyan H, Down TA, Ramagopalan SV, Uvebrant K, Nilsson A, Holland ML, Gemma C, Giovannoni G, Boehm BO, Ebers GC, Lernmark A, Cilio CM, Leslie RD, Rakyan VK: Guthrie card methylomics identifies temporally stable epialleles that are present at birth in humans. Genome Res. 2012, 22 (11): 2138-2145. 10.1101/gr.134304.111.PubMed CentralView ArticlePubMed
- Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 2004, 32 (Database issue): D493-D496.PubMed CentralView ArticlePubMed
- Wang Y, Leung FC: An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics. 2004, 20 (7): 1170-1177. 10.1093/bioinformatics/bth059.View ArticlePubMed
- Aboyoun P, Pages H, Lawrence M: GenomicRanges: Representation and manipulation of genomic intervals. R package version 1.6.7.
- Yuen RK, Penaherrera MS, von Dadelszen P, McFadden DE, Robinson WP: DNA methylation profiling of human placentas reveals promoter hypomethylation of multiple genes in early-onset preeclampsia. Eur J Hum Genet. 2010, 18 (9): 1006-1012. 10.1038/ejhg.2010.63.PubMed CentralView ArticlePubMed
- Du P, Zhang X, Huang CC, Jafari N, Kibbe WA, Hou L, Lin SM: Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics. 2010, 11: 587-10.1186/1471-2105-11-587.PubMed CentralView ArticlePubMed
- Du P, Kibbe WA, Lin SM: lumi: a pipeline for processing Illumina microarray. Bioinformatics. 2008, 24 (13): 1547-1548. 10.1093/bioinformatics/btn224.View ArticlePubMed
- Maksimovic J, Gordon L, Oshlack A: SWAN: Subset-quantile within array normalization for illumina infinium HumanMethylation450 BeadChips. Genome Biol. 2012, 13 (6): R44-10.1186/gb-2012-13-6-r44.PubMed CentralView ArticlePubMed
- Holger S: siggenes: Multiple testing using SAM and Efron's empirical Bayes approaches. R package version 1.28.0. 2011.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.