To complete the expanded annotation, we calculated additional probe location information based on the Illumina-provided MAPINFO GenomeStudio column (location of the C in the target CpG): 1) the interval of the target CpG (CpG), 2) the interval containing the probe but excluding the target CpG (Probew/o CpG) and 3) the interval of the entire probe (entire probe) (Additional files 1 and 14). Probe type (type I vs type II) and strand of design (F or R) were taken into consideration when calculating genomic location. Ten type I and ten type II probes were manually checked against the annotated probe sequence. A UCSC track was created containing the targeted Cs on the 450 k array (Additional file 15). All of the annotation and analysis of the expanded annotation was conducted on 485,512 probes, including both cg (CpG loci) and ch (non-CpG loci) probes but excluding rs (SNP assay) probes, unless otherwise specified.
The dbSNP131 table was imported into Galaxy (http://galaxyproject.org; Galaxy, Pennsylvania State University, PA, USA) from UCSC . Only rs numbers for SNPs that were an interval of 1 bp in length and of the highest quality (weight = 1) were included in the annotation. An interval file was uploaded into Galaxy using the hg19 location we annotated for the interval of each probe spanning the C and G of the target CpG for cg probes only. The probe file was intersected with the dbSNP131 table to create a list of probes with documented SNPs in the C or G of the target CpG (target CpG SNP). This file was collapsed in R (http://www.r-project.org; R Foundation for Statistical Computing, Vienna University of Economics and Business, Vienna, Austria) to create a list of rs numbers for each probe, since some target CpGs were documented with more than one SNP. The rs numbers for SNPs in the target CpG were included in the expanded annotation in the ‘target CpG SNP’ column (n = 20,270), while the number of SNPs/probes was annotated in the ‘n_target CpG SNP’ column.
Non-specific probe annotation
To identify probes that potentially have multiple genomic targets (non-specific probes), we followed the method described by Chen et al.. Special treatment of type II probes was required as the Illumina annotation has noted Cs in CpGs within the probe as an ‘R’ SNP. For type II probes that contained Rs we considered two probe sequence versions, one with all Rs replaced by As and the other with all Rs replaced by Gs. Using these conditions, we matched each of the 450 k probes with the Illumina-annotated genomic location (intended target).
Briefly, we used BLAT  to align probe sequences to four versions of the hg19 draft sequence genome: 1) a fully unmethylated ‘bisulfite treated’ genome, with all Cs converted to Ts; 2) a fully methylated ‘bisulfite treated’ genome, with only non-CpG Cs converted to Ts; 3) and 4) were the above treatments on the reverse complement sequence. BLAT was run using the following parameters: stepSize = 5, wordsize = 11 and repMatch = 1,000,000; lowering the word length led to only fractionally more hits. The selection criterion used was as previously outlined: for a probe to be considered non-specific, there had to be 90% identity over the aligned region, at least 40 of 50 matching bps, no gaps, and the 50th nucleotide had to align, as the probe hybridizes to the target CpG at this position . The number of non-specific probes hits were annotated in the expanded annotation ‘AlleleA_Hits, AlleleB_Hits’ columns, while the site of cross-hybridization was annotated in the columns ‘XY_Hits’ (if at least one hit was on a sex chromosome) and ‘Autosomal_Hits’ (if at least one hit was on an autosomal chromosome). Repetitive sequences from RepeatMasker were marked in lowercase in the four genomes. Thus we identified the amount of repetitive DNA within the Illumina-intended alignment of each probe in the expanded annotation column ‘n_bp_repetitive’.
CpG enrichment annotation
Illumina categorized probes in CpG islands (GenomeStudio column ‘Relation_to_UCSC_CpG_Island’) based on the UCSC Genome Browser criteria of CG content >50%, Obs/Exp CpG ratio >0.60 and length >200 bps. Shores and shelves were identified based on their relationship to a CpG island; shores as the 2 kbs up- and down-stream of CpG islands and shelves as the 2 kbs outside of shores. The remaining probes were located in non-island regions, which we refer to as the ‘sea’  (Additional file 5A).
We annotated probes into four HIL CpG classes based on alternative CpG enrichment criteria: high-density CpG island probes (HC, n = 153,859), intermediate-density CpG island probes (IC, n = 118,727), ICshore probes (probes in ICs that border HCs, n = 33,955) and non-island probes (LC, n = 178,971) (Additional file 5B). This annotation has been added in the ‘HIL_CpG_class’ column of the expanded annotation. To locate probes within each of the four CpG classes, we first annotated these CpG enrichment classes throughout the genome. The hg19 genomic sequence was downloaded from UCSC in overlapping segments and read by CpGIE, a Java software program . CpGIE searches input sequences in sliding windows based on user-set criteria. HCs were defined as regions with CG content >55%, Obs/Exp CpG ratio >0.75 and length >500 bps, while ICs were defined as regions with CG content >50%, Obs/Exp CpG ratio >0.48 and length >200 bps [16, 18]. CpGIE HC and IC output was merged into a single file for each chromosome, duplicate islands were removed and CpG islands were identified as follows: ICs, isolated regions of the genome with IC density; ICshores, regions of the genome with IC density that were next to regions with HC density; HCs, any region of the genome with HC density; and LCs, regions that were not of IC or HC density. Islands were given unique names in the annotation, for example, chr8_IC:49890018–49891221 (chr#_CpG class: genomic start–genomic end). The hg19 HC and IC islands have been complied into a UCSC track available in Additional file 16. The hg19 HIL annotation was intersected with the genomic location (hg19) of 450 k targets in Galaxy to assign probes into the four CpG classes. An annotation of probes into HIL CpG islands using the detailed nomenclature can be found in the expanded annotation column ‘HIL_CpG_Island_Name’.
Gene feature and TSS annotation
Using the NCBI Reference Sequence (RefSeq) gene annotation, we annotated probes into nine groups based on three gene components (first exons, exons and introns) and three gene regions (5’UTR, body and 3’UTR). Probes were grouped into: 1) 5’UTR first exons, 2) 5’UTR exons, 3) 5’UTR introns, 4) body first exons, 5) body exons, 6) body introns, 7) 3’UTR first exons, 8) 3’UTR exons and 9) 3’UTR introns (Figure 5). Briefly, the hg19 RefSeq table was downloaded from UCSC . Exon and intron information was extracted and parsed into genomic interval data with the most upstream exon denoted as the first exon. Next, 5’UTR, gene body and 3’UTR location was parsed into genomic interval data utilizing the transcription start/stop and coding start/stop information from RefSeq. Intersection was performed between each of 5’UTR, gene body and 3’UTR with first exon, exon and intron intervals to generate the nine gene features. The gene feature intervals were then intersected with the hg19 genomic location of 450 k targets in R to assign probes into the nine gene features. This annotation was completed using both RefSeq gene names and transcript names. Gene feature annotation was conducted using the GenomicRanges package in R .
The hg19 UCSC knownGene table  was downloaded to Galaxy and the closest TSS for each probe was annotated, regardless of whether the probe was located within the same gene. For each probe, the distance to the closest TSS, gene name and transcript name was noted in the expanded annotation columns ‘Closest_TSS’, ‘Distance_closest_TSS’, ‘Closest_TSS_gene_name’ and ‘Closest_TSS_Transcript’.
Two male and two female chorionic villus samples were collected through the BC Women’s Hospital & Health Centre, Vancouver, BC, Canada, as controls for a study of chromosomal abnormalities in the placenta. DNA was extracted from a small piece of chorionic villi as previously described . For each placental sample (n = 4), DNA from two independent chorionic villi was combined in equal amounts prior to bisulfite conversion to ensure a representative sample of the placenta. DNA was extracted by standard salt method. Two male and two female blood samples were collected as adult controls for ongoing studies of respiratory disease and epigenetics (n = 4). Peripheral blood mononuclear cell (PBMC) DNA was extracted according to standard procedures. Buccal epithelial samples were collected from two males and two females for a study on maternal care effects on childhood DNAm (n = 4). Buccal samples were collected using Isohelix DNA Buccal Swabs (Cell Projects Ltd, Harrietsham, Kent, UK), and stabilization reagents and DNA were extracted using Isohelix DNA Isolation Kits (Cell Projects Ltd) as per the manufacturer’s protocols.
Illumina 450 k array
Two ug of genomic DNA was purified using the DNeasy Blood & Tissue Kit (Qiagen, Valencia, CA, USA) following the manufacturer’s protocol. Purified DNA quality and concentration were assessed with a NanoDrop ND-1000 (Thermo Scientific, Waltham, MA, USA) prior to bisulfite conversion. One ug of purified genomic DNA was bisulfite converted using the EZ DNA Methylation Kit (Zymo Research, Orange, CA, USA) following the manufacturer’s protocol. Bisulfite DNA quality and concentration were assessed using the NanoDrop and, if required, samples were concentrated to approximately 50 ng/ul using a SpeedVac (Thermo Electron Corporation, Waltham, MA, USA). Following the Illumina 450 k array protocol, 4 ul of bisulfite converted sample was whole-genome amplified, enzymatically digested, hybridized to the array and then single nucleotide extension was performed .
Two assay types are used by the 450 k array to measure DNAm: Infinium I (type I probes) and Infinium II (type II probes), bound to beads scattered throughout the array. When a probe successfully binds to DNA, a single fluorescently labeled nucleotide extends off the probe and this signal is read by an Illumina scanner. The Infinium I assay uses two bead types specific to the CpG of interest: an unmethylated (u) and a methylated (m) bead, each with a different probe design (ProbeA (u) and ProbeB (m)). Both type I probes for a given CpG fluoresce in the same color channel (either red (Cy5) or green (Cy3)). The Infinium II assay uses only one bead type for each CpG of interest, an m + u bead. One probe is designed for each type II target site and the color of fluorescence is based on which nucleotide is incorporated in the single base extension step. The incorporation of an A or T signals an unmethylated site in red (u) and the incorporation of a C or a G signals a methylated site in green (m) .
Chips were scanned using an Illumina HiScan on a two-color channel to detect Cy3 labeled probes on the green channel and Cy5 labeled probe on the red channel. Illumina GenomeStudio Software 2011.1 was used to read the array output and conduct background normalization. The signalA, signalB and probe intensity were exported for autosomal probes and read into R. M values were generated using the Bioconductor (http://www.bioconductor.org; Fred Hutchinson Cancer Research Center, Seattle, WA, USA) methylumi package, M = log2(intensity m + 1/intensity u + 1) since this value has been shown to be valid for statistical analyses . Following correction for chip to chip color bias using the Bioconductor lumi package  and probe type correction using subset-quantile within array normalization (SWAN) , M values were converted to ß values using the equation ß = (2M/(2M + 1)). The ß value is a number ranging from 0 to 1 that is directly proportional to percentage DNAm; thus to ease interpretation, we have reported results as ß values. The microarray data used in this article was submitted to the NCBI GEO under accession number [GSE:42409]. Probes with a detection p value >0.01 in any sample, probes with no ß value in any sample, all rs and ch probes, all sex chromosome and non-specific probes were removed prior to analyses. The level of DNAm for 428,216 probes in our sample dataset was intersected with the expanded annotation for further analyses.
Processing of aging dataset
Series matrix files were downloaded for [GSE:40279] containing ß values for 473,039 probes per sample . We worked with the subset of samples that roughly matched the age of the samples used in our study (n = 261, aged 19 to 61 years). Probes with no ß value in any sample, all sex chromosome probes, all rs and all ch probes were removed from the dataset. For SNP analyses, non-specific probes were also removed, however these were retained in the analysis of autosomal sex-specific probes. For the discovery of autosomal probes with sex differences in DNAm, ß values were read into R, converted into M values using the Bioconductor package lumi  and then significance analysis of microarrays (SAM) was conducted using the Bioconductor package siggenes . At FDR <1%, 10,139 autosomal probes were identified as significantly different between male and female samples. Next, this list was crossed with a list of Δß values for each probe calculated by taking the absolute value of the difference between average ß of males and average ß of females.
Probe cg06961873 was selected for genotype validation of SNP rs61775206 in each sample. Primers were designed using PSQ Assay Design software version 1.0.6 (Biotage AB, Uppsala, Sweden). Primer sequences and probe information are available in Additional file 17. Using the following conditions, 0.5 ul of genomic DNA was PCR-amplified: 95°C for 5 minutes, (95°C for 20 seconds, 55°C for 20 seconds, 75°C for 20 seconds) × 50, 72°C for 5 minutes. Genotyping was performed using a PyroMark MD system (Biotage AB) and analyzed with PSQ 96MA SNP software (Biotage AB).
A KS test was used to assess the difference in distribution of SD in ß values for probes that contained SNPs. The KS statistic represents the maximum absolute difference between the cumulative distributions of two functions. Probes with small within-tissue SD in ß (<0.10) were removed from all probe groups to increase the power of the analysis. Probes with a target CpG SNP were removed from the SNP <10 bp group. The number of probes included in the SD in ß distribution curves for blood samples was 5,450 for all probes, 809 for SNP >10 bp, 402 for SNP <10 bp and 2,190 for target CpG SNP, and for the aging dataset was 6,267 for all probes, 1,022 for SNP >10 bp, 362 for SNP <10 bp and 2,753 for target CpG SNP. KS tests were also used to assess the difference in distribution of DNAm between Illumina CpG classes and between HIL CpG classes. Fisher’s exact test was used to compare the distribution of the number of probes within the three levels of DNAm for both Illumina and HIL-annotated CpG classes: hypomethylated (ß values of 0 to ≤0.2), heterogeneously methylated (ß values of >0.2 to <0.8) and hypermethylated (ß values of ≥0.8 to 1.0). Enrichment analyses of tDM probes were performed in Python (Python Software Foundation). To select tDM probes, DNAm was first averaged for each probe within a tissue. A z-score was calculated for each probe comparison between tissues. A p value cutoff of 0.05 was selected with a Bonferroni correction to account for repeated comparisons . KS and Fisher’s exact tests were performed in R. Statistical significance was considered as tests with p values <1.0 × 10-7. All figures were created in R and Adobe Illustrator CS6.