Monocytes isolation
Primary human monocytes were isolated from healthy normolipidemic volunteers (Hm02, Hm06 and Hm10) by leukapheresis and counterflow elutriation as described previously [32].
DNA extraction from monocytes and tissues
DNA was isolated from monocytes using QIAamp columns (Qiagen, Germany) and quantified with a Nanodrop 100 spectrophotometer (Peqlab, Germany).
Whole genome bisulfite sequencing and analysis
Generation of whole genome bisulfite sequencing data from monocytes obtained from donor Hm02 was performed as described previously [10, 12].
Detecting DMRs
We used the WGBS datasets from Hm02 and four additional donors (Hm01, Hm03, Hm05 and M55900; see Data retrieval and deposition) to generate two synthetic methylomes, one with the highest methylation level of each CpG in the five samples and one with the lowest methylation level. We modified BSmooth [13] to identify differentially methylated regions with a minimum difference of 0.8 between the two synthetic methylomes. BSmooth is designed to compare a group of multiple cases against a group of multiple controls. Because we have no class labels, our data consist of two single synthetic methylomes (min, max) and therefore of case and control groups of one sample each. The main formula of the BSmooth algorithm
$$t\left( c \right) = \frac{\Delta \left( c \right)}{{\left[ {\sigma \left( c \right)\sqrt {\frac{1}{{n_{2} }} + \frac{1}{{n_{2} }}} } \right]}}$$
calculates a signal-to-noise statistic t(c) for each CpG c with Δ(c) referring to the mean methylation differences of both groups.
In our case, we reduced this formula to t(c) := (max(c) − min(c)) for each CpG c. The terms max(c) and min(c) simulate the process of creating synthetic methylomes by selecting the maximum and minimum methylation level of a CpG over all samples. DMRs are formed by consecutive groups of CpGs with t(c) > v or t(c) < −v with a threshold v > 0. We use v = 0.5 as parameter for the DMR calling.
A DMR’s border may differ in shape, and DMR calling algorithms often cannot identify them exactly. In contrast to BSmooth, we calculate a DMR’s methylation level by building a weighted average methylation level
$$\mu_{i} \left( d \right) = \frac{1}{{\left| {C\left( d \right)} \right|}}\mathop \sum \limits_{c \in C\left( d \right)} \sigma \left( c \right)m_{i} \left( c \right)$$
for DMR d, its set of CpGs C(d), the methylation levels m
i
(c) and standard deviation over all samples σ(c) of CpG c ∈ C(d) in sample i. We call μ
i
(d) the core methylation of d in sample i. The core methylation is less influenced by the inaccurate DMRs borders. We only keep DMRs with high (core) mean methylation differences ≥0.8 and sufficiently long DMRs consisting of 4 CpGs or more.
To test the DMRs for statistical significance, we calculated an empirical p value by simulating 1000 sets of five samples according to the null model that there is no methylation difference as follows:
Let n
s,c
be the coverage and m
s,c
the methylation count for observed sample s at CpG c. First, we calculate the average methylation
$$p_{c} = \frac{{\sum m_{s,c} }}{{\sum n_{s,c} }}$$
value for each CpG c.
For each of the five observed samples s, we simulate a corresponding null sample o.
We set the coverage of CpG c in sample o to n
o,c
= n
s,c
.
The methylation count M
o,c
for each c is randomly chosen with binomial probability
$$P\left( {M_{o,c} = m} \right) = \left( {\begin{array}{*{20}c} {n_{o,c} } \\ m \\ \end{array} } \right) \cdot p_{c}^{m} \cdot \left( {1 - p_{c} } \right)^{{n_{o,c} - m}}$$
Therefore, the coverage of each CpG in the null sample is equal to the coverage in the corresponding observed sample, while differences in methylation are only caused by finite sampling size. Per definition, there exist no DMRs for the null samples, and every detected DMR is a false positive. We applied our algorithm for DMR detection to the null samples. We repeated the process 1000 times.
The algorithm did not detect any DMRs. This leads to an empirical p value <0.001 for each DMR.
Calculation of the DMR detection rate
We assume that a single SNP is responsible for the methylation of a DMR and that the probability of being a causative SNP is independent of its allele frequency. We further assume that the epigenotypes follow the Hardy–Weinberg equilibrium with P(AA) = p
2, P(aa) = q
2 and P(Aa or aA) = 2pq with some frequencies p and q such that p + q = 1. Our approach is only able to detect DMRs where at least one out of n samples is fully methylated and at least one sample is unmethylated.
The probability of obtaining such an event can be derived from an urn model with three different types of balls. Two types with probabilities p
2 and q
2 = (1 − p)2, respectively, represent the two different homozygous SNP states, and the third type with probability 2pq represents the heterozygous SNP state. For n samples, the probability to draw at least one ball of each of the first two types is P(p, q) = 1 − [(1 − p
2)n + (1 − q
2)n − (2pq)n], where we have applied the inclusion–exclusion principle to the complementary event.
We used the known allele frequencies of all SNPs with a minor allele frequency >0.05 contained in dbSNP [33] to estimate the fraction of detectable DMRs. For n = 5, we estimate that we can detect 23% of DMRs with a minor epigenetic allele frequency >0.05.
SNP genotyping
For donors Hm01, Hm02, Hm03, Hm05 and M55900, 2.5 million SNPs were genotyped using Illumina’s Omni2.5Exome Bead Array. For donors Hm06 and Hm10, SNP genotypes were inferred from the targeted bisulfite sequencing data (see below), or by Sanger sequencing regions amplified with primers listed in Additional file 22.
DMR SNP correlation score calculation
The mean methylation level of a sample in a region with allele-specific methylation is expected to be either close to 0.0, close to 1.0 or about 0.5. Due to inaccurate DMR borders, finite sequencing coverage and noise, measured values may differ from this expectation. We assume three possible classes “full-methylated,” “half-methylated” and “unmethylated” for this epigenotype.
In order to compare these epigenotypes with SNP genotypes, we have to classify the methylation level of each sample for each DMR. To avoid fixed thresholds for class assignment, we calculate the posterior probabilities of mean DMR methylation level to fall into each of the classes. We consider the empirical distribution (histogram) of 157 × 5 = 768 core methylation levels μ
i
(d) of each sample i and DMR d, which contains data from all three classes. This empirical distribution can be decomposed into a three-component mixture of beta distributions. A beta distribution is a continuous probability distribution on the unit interval [0, 1] that is frequently used to model data that naturally takes values between 0 and 1 [34] such as methylation levels. Each component beta distributions have two parameters α and β that determine the shape of the beta distribution. We used the betamix software [35] to robustly fit a three-component beta mixture model to the observed histogram.
Let α
i
and β
i
be the beta distribution parameters and π
k
the mixture coefficient of component k € {0, 1, 2} after fitting. For a single sample, a DMR with a core methylation level μ and one SNP genotype g € {0, 1, 2}, the posterior probability of g given μ is given by
$$L\left( {g,{{\mu }}} \right) = \frac{{\pi_{g} b_{{\alpha_{g} ,\beta_{g} }} \left( {{\mu }} \right)}}{{\mathop \sum \nolimits_{k} \pi_{k} b_{{\alpha_{k} ,\beta_{k} }} \left( {{\mu }} \right)}}.$$
To calculate a score based on multiple samples, we extend the formula. Let g
i
(s) € {0, 1, 2} be the genotype and μ
i
(d) be the core methylation level (see “Detecting DMRs” section) for DMR d, sample i and SNP s. The joint posterior probability
$${\text{score}}\left( {s,d} \right) = \mathop \prod \limits_{i} L\left( {g_{i} \left( s \right),\mu_{i} \left( d \right)} \right)$$
is given by the product of the single posterior probabilities over all samples. We use this posterior probability as a score to assess whether DMR d and SNP s are co-varying. The scores are separately calculated for each DMR and each SNP within a range of ±6 kb of the DMR’s location. For n = 5 samples, we used 0.9 as a threshold to call an SNP correlated with a DMR.
GWAS analysis
The SNP array data were produced with three different SNP array types: Omni1_Quad_v1 (334 probands), OmniExpress_12v1.0 (627 probands) and OmniExpress_12v1.1 (170 probands). The data were normalized and CpG methylation levels extracted using RnBeads v1.2.2 [36].
We filtered each array separately by removing SNPs that failed the Hardy–Weinberg test at a significance threshold of 0.001, having a minor allele frequency less than 0.01 or a missing rate greater than 0.1 using plink v1.07 [37]. The arrays were merged by plink and the data again filtered by plink using the previously described parameters. This merged data served as genotypes for the GWASs.
The Spearman correlations and p value calculation between methylation levels and SNP genotypes were performed using NumPy v1.11.0 and SciPy v0.14.0 [38].
For the imputation, a region was chosen that includes all SNPs with a p value <5 × 10−8 but to a maximum of ±1 Mb of the CpGs position. The arrays were then converted to ped-format using gtool v0.7.5 [39] and separately imputed using impute2 [40] for the determined regions with the phase 3 data of the 1000 genomes project [41]. The imputed data were reconverted to bed-format again using gtool and merged under the previously given filter parameters by plink.
DMR validation by targeted deep bisulfite sequencing
Bisulfite-converted DNA was obtained using 500 ng of monocytes DNA (Hm01, Hm02, Hm03, Hm05, Hm06 and Hm10) and the EZ DNA Methylation-Gold Kit (Zymo Research) according to the manufacturer’s instructions. Locus-specific bisulfite amplicon libraries were amplified by PCR employing bisulfite tagged primers (Additional file 22) designed using the MethPrimer [42] and BiSearch [43, 44] tools and HotStarTaq Master Mix (Qiagen). Sample-specific barcode sequences (MID, multiplex identifiers) and universal linker tags (454 adaptor sequences) were added by performing a second PCR. Samples were prepared and sequenced on a Roche/454 GS Junior system (Roche Diagnostics) with special filter settings applied to increase the yield of reads [45]. Automated CpG methylation analysis was performed using the Amplikyzer software [46] with minimum bisulfite conversion rate set to 95%, leading to an average of 2450 reads per sample (minimum 187).
Histone modification ChIPseq heatmaps
Heatmaps visualizing the ChIP log2-ratio between signal and input across six histone modifications in two biological replicates were generated using deepTools [47] as previously described [10], except that we plotted data from 2-kb regions centered on the middle of 157 DMRs and clustered them using k = 5 clusters in the k-means algorithm. Independent clustering was performed for Hm03 and Hm05, since the two donors differ in the DMRs DNA methylation values.
Chromatin segmentation by chromatin states
Chromatin segmentation of samples Hm03 and Hm05 was performed with ChromHMM [16]. We estimated the p value for over- and underrepresented chromatin states by simulating 1 million datasets, consisting of 151 regions each and equal size distribution compared to our DMRs. Each of the simulated regions was selected from non-repetitive regions covering at least 4 CpGs. For each of our 157 DMRs and each region of the simulated datasets, the overlap between its coordinates and the chromatin states of each sample (Hm03, Hm05) was calculated. We then compared the count of overlapping chromatin states of a sample for our DMRs and each random set. The empirical p value for overrepresentation for state x is the fraction of random sets that have a higher count for x than the DMRs. The empirical p value for underrepresentation for state x is the fraction of sets that have a lower count for x than the DMRs. We partitioned the absolute methylation differences between Hm03 and Hm05 into (1) a set for the DMRs with different chromatin states in the two donors and (2) another set for the DMRs with the same states. We consider a state as different in the two donors, if the intersection of the DMR overlapping chromatin states is empty. The two sets of methylation differences were then compared by applying the Wilcoxon rank-sum test to determine if the methylation differences are independent from a chromatin state difference.
GREAT analysis
The bioinformatic tool GREAT was used to predict DMR functions by analyzing the annotations of nearby genes [48], under species assembly GRCh37 with whole genome background and choosing the “Basal plus extension” association rule setting with default parameters of 5.0 kb upstream, 1.0 kb downstream and up to 1000.0 kb distal.
Distribution of gene expression levels
To obtain the gene expression rates for each gene, we summed the transcript per million (tpm) values as calculated by kallisto with default parameters [49]. Since GREAT uses only the extremely high-confidence genes prediction subset of the UCSC Known Genes, we reduced the kallisto gene list to this subset (n = 17,784). We then partitioned the mean expression rates of Hm03 and Hm05 for each gene into two groups: genes that are associated with a DMR as identified by GREAT (n = 240) and genes that are not associated (17,544). We compared the expression rates of these two groups by applying a Wilcoxon rank-sum test to test for differences.
Identification of transcription factor-binding motifs
We used the TRANSFAC database (professional version, release 2015.3, [50]) to determine, if certain motifs were enriched in the SNP regions (501 regions: SNP ± 100 bp). Example regions as provided by TRANSFAC served as background, and the parameters were set to default.
Differential transcript expression
We ran Tophat 2.0.11 [51], with Bowtie 2.2.1 [52] and NCBI build 37.1 using the following parameters: –library-type fr-firststrand and –b2-very-sensitive setting, to generate the mapping files from total-RNA of Hm03 and Hm05 samples. Subsequently, StringTie [53] with NCBI build 37.1 was run in -e -b -G mode to generate files for analysis with Ballgown [54]. Differential transcript expression analysis was performed using Ballgown, and an FDR cutoff of 0.05 was chosen to extract the differentially expressed transcripts.
Long non-coding RNAs
LNCipedia 4.0 [55, 56] (GRCh37/hg19) was used to extract high-confidence lncRNA regions. Bedtools was subsequently run in “intersect” mode to get the overlap of the differentially methylated regions with the lncRNA regions.
Data retrieval
The full epigenome data from Hm03 and Hm05 monocytes (Study Accession ID: EGAS00001001595, Dataset Accession ID: EGAD00001002201) as well as the methylome data from M55900 (ENA PRJEB5800) and Hm01 (EGAS00001000719) have previously been produced by our group [10, 12]. The BLUEPRINT and CEEHRC WGBS datasets on human monocytes from other male donors were retrieved from the IHEC Data Portal (http://epigenomesportal.ca/ihec/grid.html; [57]). In addition, 450k array data of monocytes and whole blood DNA obtained from six individuals were downloaded from the gene expression omnibus (GSE35069) [20].