Comparison of Methylation Capture Sequencing and Innium EPIC Methylation Array in Peripheral Blood Mononuclear Cells

Background: Epigenome-wide association studies (EWAS) have been widely applied to identify methylation CpG sites associated with human disease. To date, the Innium Methylation EPIC array (EPIC) is commonly used for high-throughput DNA methylation proling. However, the EPIC array covers only 30% of the human methylome. Methylation Capture bisulte sequencing (MC-seq) captures target regions of methylome and has advantages of extensive coverage in the methylome at an affordable price. Methods: Epigenome-wide DNA methylation in four peripheral blood mononuclear cell samples was proled by using SureSelectXT Methyl-Seq for MC-seq and EPIC platforms separately. CpG site-based reproducibility of MC-seq was assessed with DNA sample inputs ranging in quantity of high (> 1000ng), medium (300-1000ng), and low (150ng-300ng). To compare the performance of MC-seq and the EPIC arrays, we conducted a Pearson correlation and methylation value difference at each CpG site that was detected by both MC-seq and EPIC. We compared the percentage and counts in each CpG island and gene annotation between MC-seq and the EPIC array. Results: After quality control, an average of 3,708,550 CpG sites per sample was detected by MC-seq with DNA quantity >1000ng. Reproducibility of MC-seq detected CpG sites was high with strong correlation estimates for CpG methylation among samples with high, medium, and low DNA inputs (r > 0.96). The EPIC array captured an average of 846,464 CpG sites per sample. Compared with the EPIC array, MC-seq detected more CpGs in coding regions and CpG islands. Among the 472,540 CpG sites captured by both platforms, methylation of a majority of CpG sites was highly correlated in the same sample (r: 0.98~0.99). However, methylation for a small proportion of CpGs (N=235) differed signicantly between the two platforms, with differences in beta values of greater than 0.5. Conclusions: Our results show that MC-seq is an ecient and reliable platform for methylome proling a broader of the than the array-based methylation a show large discrepancy between the two platforms, which warrants further investigation and needs cautious interpretation.


Introduction
The rapid increase in the number of epigenome-wide association studies (EWAS) have successfully identi ed differentially methylated CpG sites that are associated with environmental exposures and diseases (1)(2)(3)(4)(5)(6). Such DNA methylation marks have been used as biomarkers for diagnosing, subtyping, and monitoring disease progression (7)(8)(9)(10)(11). One of the most popular and affordable methods to pro le epigenome-wide DNA methylation are array-based platforms, primarily the Illumina Human Methylation 450K (450K) and In nium MethylationEPIC (EPIC) BeadChips (Illumina Inc, San Diego, CA). These arrays both utilize Illumina's beadchip technology that does not require polymerase chain reaction (PCR), but is subject to dye intensity biases between the two platforms (12). These arrays have limited coverage of the methylome and can only detect up to 870,000 CpGs across the epigenome, leaving a large proportion of CpG sites unmeasured. Moreover, the EPIC array offers improved but still suboptimal coverage of regulatory elements (13). Whole genome bisul te sequencing (WGBS) is able to capture more than 28 million CpGs, but the feasibility remains low for the population-based EWAS due to high cost and large genomic DNA input requirements to compensate for degradation during DNA bisul te treatment. Alternatively, Methylation Capture Sequencing (MC-seq) is able to detect DNA methylation at singlenucleotide resolution utilizing a targeted next-generation sequencing approach (14). It permits pro ling of signi cantly more CpG sites than the EPIC array, requires less genomic DNA input than WGBS, and less expensive than WGBS, but can be susceptible to bias due to the presence of PCR duplicates. Feature-tocost comparisons among different platforms can help understand the utilities of each platform and provide guidance for investigators in choosing a methylation pro ling platform.
A few studies have compared the CpG coverage, reproducibility, and performance of array-based and MCseq platforms (15)(16)(17). Teh et al compared MC-seq and the 450K array in seven DNA samples extracted from saliva (15). A recent study compared the EPIC array and TruSeq targeted bisul te sequencing in four cord blood DNA samples (17). However, no comparisons of MC-seq and array-based methylome pro ling of peripheral blood mononuclear cell (PBMC) has been reported. Here, we pro led the DNA methylome in PBMC using the Agilent SureSelect Methyl-Seq platform and compared the results to the EPIC array in DNA samples extracted from PBMCs.

Methods
Methylation Capture Sequencing (MC-seq) DNA Samples Description. DNA was extracted from de-identi ed PBMC collected from four individuals. Genomic DNA quality was determined by estimating the A260/A280 and A260/A230 ratios by spectrophotometry and concentration by uorometry. DNA integrity and fragment size were con rmed using a micro uidic chip run on an Agilent Bioanalyzer. To assess the reproducibility of MC-seq by DNA quantity, DNA samples from each participant were pro led in triplicate times with high (>1000ng), medium (300-1000ng) and low (150-300ng) DNA input. In total, 12 DNA samples were measured by MCseq. Bisulfate conversion was conducted for each DNA sample as described below.
Methyl-Seq Target Enrichment Library Prep. Indexed paired-end whole genome sequencing libraries were prepared using the SureSelect XT Methyl-Seq kit (Agilent, part#G9651B). Genomic DNA was sheared to a fragment length of 150-200 bp using focused acoustic energy delivered by the Covaris E220 system (Covaris, part#500003). Fragmented sample size distribution was determined using the Caliper LabChip GX system (PerkinElmer, Part#122000). Fragmented DNA ends were repaired with T4 DNA Polymerase and Polynucleotide Kinase and "A" base was added using Klenow fragment in a single reaction followed by AMPure XP bead-based puri cation (Beckman Coulter, part#A63882). The methylated adapters were ligated using T4 DNA ligase followed by AMPure XP bead puri cation. Quality and quantity of adapterligated DNA were assessed using the Caliper LabChip GX system. Samples yielding >350 ng were enriched for targeted methylation sites by using the custom SureSelect Methyl-Seq Capture Library.
Hybridization was performed at 65°C for 16 hours using a C1000 Thermal Cycler (BIO-RAD, part# 1851197). Once the enrichment was completed the samples were mixed with streptavidin-coated beads (Thermo Fisher Scienti c, part#65602) and washed with a series of buffers to remove non-speci c bound DNA fragments. DNA fragments were eluted from beads with 0.1 M NaOH. Unmethylated C residues of enriched DNA were modi ed by bisul te conversion using the EZ DNA Methylation-Gold Kit (Zymo Research, part#D5005). The SureSelect enriched, bisul te-converted libraries were PCR ampli ed using custom-made indexed primers (IDT, Coralville, Iowa). Dual-indexed libraries were quanti ed by quantitative polymerase chain reaction (qPCR) using the Library Quanti cation Kit (KAPA Biosystems, Part#KK4854) and inserts size distribution was assessed using the Caliper LabChip GX system. Samples with a yield of ≥2 ng/ul were proceeded to sequencing.
Flow Cell Preparation and Sequencing. Sample concentrations were normalized to 10 nM and loaded onto an Illumina NovaSeq ow cell at a concentration that yields 40 million passing lter clusters per sample. Samples were sequenced using 100bp paired-end sequencing on an Illumina HiSeq NovaSeq according to Illumina standard protocol. The 10bp dual index was read during additional sequencing reads that automatically follows the completion of the rst read. Data generated during sequencing runs were simultaneously transferred to the Yale Center for Genome Analysis high-performance computing cluster. A positive control (prepared bacteriophage Phi X library) provided by Illumina was spiked into every lane at a concentration of 0.3% to monitor sequencing quality in real time.
Preprocessing and Quality Control. Signal intensities were converted to individual base calls during a run using the system's Real Time Analysis (RTA) software. Sample de-multiplexing and alignment to the human genome was performed using Illumina's CASAVA 1.8.2 software suite. The sample error rate was required to be less than 1% and the distribution of reads per sample in a lane was required to be within reasonable tolerance.
Quality control (QC) on MC-seq was conducted following standard procedure as previously described (18). Quality of sequence data was examined by using FastQC (ver. 0.11.8). Adapter sequences and fragments at 5' and 3' (phred score <30) with poor quality were removed by Trim_galore (ver. 0.6.3_dev). We used Bismark pipelines (ver. v0.22.1_dev) to align the reads to the bisul te human genome (hg19) with default parameters (19). Quality-trimmed paired-end reads were transformed into a bisul te converted forward strand version (C->T conversion) or into a bisul te treated reverse strand (G->A conversion of the forward strand). Duplicated reads were removed from the Bismark mapping output and CpG, CHG and CHH (where H = A, T or C) were extracted.
All CpG sites were grouped by sequencing coverage, also known as read depth. The groups with coverage of 1x to 100x were used to test the relationship between coverage and number of CpG sites. Only the CpG sites with coverage > 10x depth were used for nal comparisons to ensure MC-seq data quality. Genes were annotated using Homer annotatePeaks.pl, including intergenic, 5'UTR, promoter, exon, intron, 3'UTR, transcription start site (TTS), and non-coding categories. CpG island, shore, shelf, and open sea annotation was de ned by locally developed bash and R scripts based on genomic coordinates (hg19) of CpG islands from the UCSC genome browser. De nition of CpG shores was de ned as up to 2 kb from CpG islands and CpG shelf was de ned as up to 2 kb from a CpG shore.
Assessment of Reproducibility. We assessed CpG-and participant-based reproducibility for MC-seq among 12 samples with DNA quantity of high, medium, and low input in two ways. First, CpG-based reproducibility was assessed by calculating Pearson correlations using the CpG sites in common from the samples from the same participant with different input DNA quantities. Scatterplots were rendered showing 10,000 randomly selected common CpG sites comparing samples with high and medium, high and low and medium and low DNA input. Second, participant-based reproducibility was assessed by comparing methylation pro les among pairs of participants using the samples with high DNA input, by calculating Pearson correlations of common CpG sites.

EPIC Array Data Preprocessing
The In nium Methylation EPIC array (Illumina, San Diego, CA, USA) was used to measure PBMC DNA methylation pro les from the same four participants. These four samples with DNA input of 1000 ng were preprocessed using standard procedures as previously described (20). Brie y, the predicted sex based on methylome was consistent with self-reported sex for all samples. All samples had a call rate greater than 0.15. A total of 19,090 CpG sites on X chromosomes and 537 CpG sites on Y chromosomes were ltered. Probes within 10 base pairs of a single nucleotide polymorphism were removed. A total of 846,464 CpG sites passed quality control.

Comparison of Methylation at Each CpG site Between MC-seq and EPIC Array
The overall distribution of gene annotation in terms in relation to CpG island and genetic region between MC-seq and EPIC array data among samples from the four participants were compared. CpG sites common between MC-seq and EPIC array assays were de ned according to genomic coordinates. Pearson correlation and the absolute beta-difference value (Db) were calculated among common CpG sites between MC-seq methylation percentage values and EPIC methylation beta values by using R (ver. 3.5.1). If median Db of the common CpG site was > 0.1, it was de ned as a discordant CpG pair between MC-seq and EPIC; otherwise, the CpG site was de ned as a concordant CpG pair. The density plot of Db and a Manhattan plot showing the distribution of Db across epigenome were illustrated. Scatterplots were rendered showing the correlation of b values from 10,000 randomly selected CpG sites measured by both MC-seq and EPIC array.

MC-seq Overview and Reproducibility
In MC-seq, all sequences were e ciently mapped to the reference genome with greater than 89% mapping e ciency. Interestingly, the number of non-CpG sites was signi cantly greater than the number of CpG sites. Among all detected methylation sites by MC-seq, 11% were CpG sites, 65% were CHH sites, and 24% were CHG sites (Figure 1a).  (15,17). Thus, the number of CpG sites with coverage ≥ 10x from MC-seq was used in subsequent analyses.
After quality control ltering, MC-seq captured an average of 2,878,207 methylation CpG sites with coverage ≥10x among the 12 DNA samples, with an average of 3,708,550 CpG sites among samples with high DNA input (>1000ng), an average of 3,046,172 CpG sites among samples with medium DNA input (300-1000ng), and an average of 1,879,898 CpG sites among samples with low DNA input (150-300ng) (Figure 1c and Table 1). Despite the fact that the detected number of CpG sites varied depending on DNA input quantity, CpG-based correlation among the common CpG sites between samples with high and medium, high and low DNA input quantity exceeded r>0.95. Correlations of common CpG sites between medium and low DNA inputs was also high with r in 0.92-0.94 (Table 2). Figure 1d shows the scatterplot of 10,000 randomly selected common CpGs between samples with high and medium, high and low, and medium and low DNA input quantity. Pair-wise participant-based correlations were high as r>0.98 among common CpG sites (Table 3). Overall, MC-seq exhibited good reproducibility. The methylation pro le generating in high DNA input from each participant was used for subsequent analyses.    times more on other regions compared with the EPIC array. The proportion of CpG islands detected by MC-seq was greater than by the EPIC array (42% versus 29%) while the EPIC array detected a modestly higher percentage of CpG sites located in open seas than the MC-seq (39% versus 31%) (Figure 2b).

Comparison of Common CpG sites Measured by MC-seq and EPIC
A total of 472,540 CpG sites were measured by both platforms. Overall, the correlations of these shared CpG sites was high, ranging from r=0.983 to 0.985 across the four samples (Figure 3a). Figure 3b presents the distribution of the absolute difference of methylation b values between MC-seq and EPIC. A small proportion of CpG sites (1.4%) were discordant (i.e., Db > 0.1), while 98.6% of CpG sites were concordant (i.e., Db < 0.1). Figure 3a presents the concordant (blue) and discordant CpG sites (green) between MC-seq and EPIC for participant S1 (Figure 3a). The 60,753 discordant CpG sites appeared to be randomly distributed across the epigenome ( Figure S1). Among the discordant CpG sites, we identi ed 239 CpG sites with highly discrepant methylation (i.e., Db>0.5) (Table S1). Participants S2, S3, S4 has similar distribution of concordant and discordant plots as participant S1 and in shown in Figure S2.
Density plots of methylation b showed bimodal distribution using both the MC-seq and the EPIC array platforms (Figure 3c). Density of methylated CpG sites was slightly higher than the density of unmethylated CpG sites on both platforms. However, the two peaks in the EPIC array density plot were closer than the two peaks in the MC-seq density plot (Figure 3c), indicating that MC-seq captures a higher dynamic range (i.e., more methylated and unmethylated) of CpG sites than the EPIC array. Participants S2, S3, S4 has similar density plots and in shown in Figure S3.

Discussion
We pro led the same PBMC samples using the MC-seq and EPIC array platforms and compared their performance. Our results show that the Agilent SureSelect Methyl-Seq targeted enrichment platform produced high quality DNA methylation sequencing data at single base-pair resolution. MC-seq can reliably detect CpG sites with DNA input quantities as low as 300ng. Overall, MC-seq detected 3-4 times more CpG sites than the EPIC array; however, the proportion of CpG sites mapped on functional genomic regions was similar between the two platforms. Methylation at a majority of CpG sites between the two platforms was highly correlated while methylation at a low percentage of CpG sites differed signi cantly between the two platforms. Speci cally, we found that methylation at 239 CpG sites differed signi cantly between the two platforms with absolute Db values greater than 0.5, which suggests that these CpG sites should be interpreted with caution in EWAS studies.
Our results show that MC-seq produces highly reliable CpG site methylation estimates across the genome. The observed CpG-based reproducibility is high, suggesting that technical variation on CpG calls is low. Inter-personal methylation variation is important for EWAS analysis. We found that our participantbased methylation on common CpG sites across four participants are also highly correlated, which further demonstrates the high reproducibility of this platform.
One disadvantage of sequencing-based approaches is the requirement for a larger quantity of input DNA than array-based approaches for methylation pro ling. The recommended input DNA for Agilent SureSelect platform is 1ug while input DNA quantity for EPIC array can be as low as 250ng. Input DNA quantity is one important consideration in uencing study design and methylation assay platform selection for population-based EWAS. Agilent has reported that DNA quantity can be as low as 250ng for SureSelect sequencing (14). To examine whether DNA quantity impacts the performance of MC-seq and to test whether low input DNA quantity also produces reliable CpG detection, we compared the capacity of CpG site detection across three different DNA input quantities. We found that medium DNA input quantity (i.e., 300ng to 1000ng) reliably detected CpG sites is comparable to the number of CpG sites captured by high DNA input quantity (i.e., greater than 1000ng). Low DNA input quantity (i.e., less than 300ng) detected the lowest number of CpG sites compared with high and medium DNA input quantity. For samples with low DNA input quantity, additional PCR cycles are needed to ensure post-capture library yield that results in extensive duplicate reads. In the four low DNA input samples, the duplicate rate exceeds 80%. Thus, removing duplicate reads is an important step in the QC process for MC-seq. We found that the number of CpG sites in low DNA input samples without duplicated reads still is signi cantly higher than the number of CpG sites detected by the EPIC array.
Consistent with previous reports, we found that methylation at the majority of CpG sites measured by both approaches (>98%) is highly consistent between MC-seq and array-based methods. However, we identi ed 1.4% of CpG sites with discrepancies in CpG methylation that exceeds 10%. More importantly, 239 out of 60,753 discordant CpG sites had methylation differences exceeding 50%. These CpG sites are located on 159 gene regions ( Table 4). Some of these genes have been previously reported to be associated with diseases. For example, SLC45A4 was reported to harbor an epigenetic marker for adiposity (21). The methylation b differs on the CpG site of this gene by as much as 0.63 between the two platforms. We have also identi ed those CpG sites that showed less but still apparent discrepancy between the two assay platforms (i.e., absolute difference of beta values between 0.1-0.5). The top 100 CpG sites discrepant in a range of 0.1-0.4 between two platforms are presented in Table S2 to allow investigators to consider this potential source of bias in EWAS ndings. The discrepancy might be due to bias in the performance of the beadchip assay at these positions, sequence context-dependent impacts on the performance of sequencing, batch effects, or a combination of these possibilities. This large discrepancy warrants further investigation and interpretation of ndings at these CpG sites must be interpreted with caution.
One of the limitations of this study is the small number of participants used to estimate inter-sample variability. A previous study used a benchmark approach to evaluate performance of different platforms (17) and concluded that the EPIC array performed better than the MC-seq platform. However, the study did not remove duplicate reads as part of their data processing, which may have comprised the QC for MC-seq data processing as discussed above. Future studies including benchmarking using a larger sample size could further improve the analysis of platform performance. Of note, MC-seq detected high percentages of CHG and CHH sites across four methylome, which is consistent with previous reports (15). The signi cances of those methylation sites warrant further investigation.
New approaches to measurement of DNA methylation continue to emerge that may warrant similar investigation in an ongoing effort to provide users with empiric comparisons to inform decisions about platform selection. One recent approach is enzymatic methyl-sequencing (EM-seq) (e.g., NEBNext EM-seq by New England Biolabs, Ipswich, MA) (22). The input genomic DNA, requirement is low 10-200 ng and EM-seq has comparable performance to WGBS (22), but its performance in relation to array-or capture sequencing-based approaches has not been reported. Should EM-seq gain popularity, it would be important to directly compare the performance of MC-seq and EM-seq to provide empiric evidence to users to inform platform selection.
Nevertheless, we have demonstrated that MC-seq is an e cient, reliable, and affordable platform that allows medium input quantity of DNA input (i.e., >300ng), which is equivalent to DNA input required for EPIC array. MC-seq has the advantage of capturing signi cantly more CpG sites than the EPIC array. Although methylation measurements between the two platforms are highly consistent, we have identi ed a small number of CpG sites that must be interpreted with caution if they are associated with a trait of interest because they showed signi cant discrepancies between the two platforms.

Conclusions
Our results show that MC-seq is an e cient and reliable platform for methylome pro ling with a broader coverage of the methylome than the array-based platform. Although methylation measurements in majority of CpGs are highly correlated, a number of CpG sites show large discrepancy between the two platforms, which warrants further investigation and needs cautious interpretation.