Fallopian tube sample preparation
Fallopian tubes from two primary patients were delivered in saline from a local hospital. Upon arrival, samples were washed using sterile 1x Phosphate Buffered Saline and snap-frozen in liquid nitrogen. At a later date, the fallopian tubes were thawed and minced. DNA was extracted from the minced fallopian tubes using the tissue protocol from Qiagen’s DNeasy Blood & Tissue Kit (69504). Following extraction, dsDNA was quantified using Invitrogen’s Qubit 3.0 Fluorometer.
Whole-genome methylation sequencing libraries
Methylated pUC19 and unmethylated lambda phage DNA (0.0005% and 0.01%, respectively) were added to each high molecular weight genomic DNA sample. These are included as methylation controls in the NEB protocol and were added to the DNA used in each protocol for consistency. The DNA was then sheared to approximately 350 bp in average size for all prepared libraries, with the exception of the post-bisulfite adapter tagging (PBAT) libraries, where the DNA was not initially sheared.
Libraries were prepared from the KAPA Hyper Prep kit (v6.17) (Roche Sequencing, Cat. #KK8504) with an input of 300 ng of sheared DNA following the manufacturer’s protocol with the following modifications. Illumina TruSeq Nano adapters at a concentration of 10 \(\mu\)M were used. The post ligation cleanup elution was reduced to 20 \(\mu\)L and the entire DNA elution went into the EZ DNA Methylation-Gold kit (Zymo Research, Cat. #D5005). The bisulfite converted DNA was eluted in 20 \(\mu\)L and 10 cycles of library amplification were performed using the KAPA HiFi HotStart Uracil+ ReadyMix (Roche Sequencing, Cat. #KK2800).
Libraries were prepared from the Accel-NGS Methyl-Seq DNA library kit (v3.0) (Swift Biosciences, Cat. #30024) with an input of either 10 ng or 100 ng of sheared DNA following manufacturer’s protocol with the following modifications. The DNA was bisulfite converted using the EZ DNA Methylation-Gold kit (Zymo Research, Cat. #D5005) with an elution volume of 15 \(\mu\)L. Following adapter ligation, either 8 cycles (10 ng DNA input) or 4 cycles (100 ng DNA input) of library amplification were performed.
Libraries were prepared from the NEBNext Enzymatic Methyl-seq kit (New England Biolabs, Cat. #E7120L) using an input of either 10 ng or 200 ng of sheared DNA and libraries were made according to the manufacturer’s protocol. The denaturation method used was 0.1 N sodium hydroxide, according to the protocol, and either 8 cycles (10 ng DNA input) or 4 cycles (200 ng DNA input) of PCR amplification were performed.
Libraries were prepared from the PBAT method described by Miura and Ito [13] using an input of 100 ng of genomic DNA that went directly into the EZ DNA Methylation-Gold kit (Zymo Research, Cat. #D5005) with an elution volume of 21 \(\mu\)L. Then, 10 \(\mu\)L of the bisulfite converted DNA was used for making the PBAT libraries as previously described in [13] with the modification of using KAPA HiFi HotStart Uracil+ ReadyMix (Roche Sequencing, Cat. #KK2800) for the DNA template extension step.
KAPA pure beads (Roche Sequencing, Cat. #KK8001) were used for cleanup steps for all prepared libraries.
Quality and quantity of the finished libraries were assessed using a combination of the Agilent High Sensitivity DNA chip (Agilent Technologies, Inc., Cat. #5067-4626), QuantiFluor® dsDNA System (Promega Corp., Cat. #E2670), and Kapa Illumina Library Quantification qPCR assay (Roche Sequencing, Cat. #KK4824). 100 bp paired-end sequencing was performed on an Illumina NovaSeq6000 sequencer using an S4, 200 bp sequencing kit (Illumina Inc., San Diego, CA, USA), with 10% PhiX. Base calling was done by Illumina RTA3 and output of NCS was demultiplexed and converted to FASTQ format with Illumina Bcl2fastq (v1.9.0).
Alignment and methylation extraction
Upon receipt of the FASTQ files, the files were trimmed using Trim Galore [31] version 0.6.4_dev (using Cutadapt version 2.10). Default inputs were used, other than: –illumina –trim-n –paired –cores 4 –fastqc –fastqc_args "–noextract". In addition, the FASTQ files for the Swift samples included –clip_R2 14, due to the Adaptase™ method used by Swift Biosciences [32].
The trimmed FASTQ files were aligned to GRCh38 [33] using BISCUIT [34] version 0.3.16. An index for the reference genome was created using biscuit index GRCh38.p13.genome.fa, followed by aligning each sample to the indexed reference. The alignment step used the default options for biscuit align, with these exceptions: -M -t 20 -R sample_specific_read_group. Each sample received its own read group (-R tag). The aligned reads were duplicate marked using Samblaster [35] version 0.1.25, with the -M flag and defaults. The reads were then coordinate sorted and indexed using Samtools [36] version 1.10 (with htslib 1.10.2). Default options were used, with the following exceptions -@ 20 -m 5G -o sample_name.sorted.markdup.bam -O BAM (sort) and -@ 20 (index).
The extraction of cytosine methylation information proceeded as follows. Pileup VCF files were generated using biscuit pileup with default parameters. bgzip and tabix (included with htslib version 1.10.2) were used to compress and index the VCF files. Default parameters were used for bgzip and tabix, with the exception of -p vcf in the call for tabix. The VCF files were then processed through biscuit vcf2bed, bedtools sort (bedtools [37] version 2.29.2), and biscuit mergecg to create coordinate sorted BED files containing CpG methylation beta value information. For each command, the default parameters were used. After creating the BED files, they were compressed and indexed using bgzip and tabix, with default parameters being used in both cases, with tabix also including -p bed.
Any code not explicitly stated in this and subsequent sections can be found on GitHub [38].
FASTQ and alignment quality control
Quality-control data were collected from a number of sources and viewed using MultiQC [39] version 1.9. Quality control for the FASTQ files was generated by FastQC [40] version 0.11.9 and Cutadapt during the trimming process. Command arguments used can be found in the previous section. Statistics on the percentage of duplicate marked reads were produced by Samblaster. Library complexity was calculated by Preseq [41] version 2.0.3 using these options, c_curve -B -P -v -o sample.complex.ccurve.txt, where “sample” is the name of each sample. Quality controls from Samtools were generated with samtools stats and samtools flagstat, with default parameters and -@ 20. BISCUIT includes a custom BASH script to generate quality-control statistics related to data aligned by BISCUIT. This script was run with this command, QC.sh -v samp.pileup.vcf.gz -o samp_QC hg38_assets GRCh38.p13.genome.fa samp samp.sorted.markdup.bam. In each case, “samp” corresponds to the name of the processed sample.
The hg38_assets mentioned in the BISCUIT quality-control script command can be found in a zip file on the BISCUIT GitHub release page [42].
Library protocol comparison analysis
To collect statistics related to the raw reads stored in the FASTQ files, a custom Python (version 3.7.6) script was written. It uses the gzip, glob, and time Python base modules and these additional Python packages: argparse (version 1.1), numpy (version 1.18.1), and biopython (version 1.76). Statistics regarding the trimmed reads were collected from log files generated by Cutadapt, as described previously.
For a number of the analyses, the aligned BAM files were subsampled before calculating the corresponding metric. The BAMs were subsampled using samtools view -hbu -F 0x4 -q 40 sample.bam | samtools view -hbu -s FRAC -. “FRAC” was calculated as
$$\begin{aligned} \frac{150,000,000}{Number~Mapped~Fragments~with~MAPQ\ge 40}, \end{aligned}$$
(1)
for each sample. The subsampled BAMs were sorted and indexed using Samtools. Pileup VCF files and merged CpG BED files were generated in a similar manner to the original BAMs, as described previously.
Using the subsampled BAMs, the average coverage and percentage of covered CpGs within different genomic regions, including CpG islands, exons, genes, and repeat-masked regions, were calculated using custom scripts. The CpGs that fell in each region were determined by intersecting a BED file containing CpG coordinates and the coverage at those locations with a BED file containing the region’s coordinates. The coverage was determined using bedtools genomecov, while the intersection was done using bedtools intersect. The average coverage was calculated by taking a weighted average of the coverage for each CpG. The percentage of covered CpGs was calculated by
$$\begin{aligned} \frac{Number~of~CpGs~in~Region~with~Coverage>1}{Total~Number~of~CpGs~in~Region}. \end{aligned}$$
(2)
The scripts used to calculate these values made use of GNU parallel [43].
The observed coverage to expected coverage ratio was calculated as \(ratio = Obs / Exp\), where:
$$\begin{aligned} Obs&= \frac{Number~of~Bases~Mapped~to~Region}{Total~Number~of~Mapped~Bases} \end{aligned}$$
(3)
$$\begin{aligned} Exp&= \frac{Sum~of~Mappability~Scores~in~Region}{Total~Sum~of~Mappability~Scores~in~Genome}. \end{aligned}$$
(4)
This formulation assumes all bases are not equally accessible when sequencing. The expected coverage takes into account this difference in accessibility by including mappability scores based on the Bismap k100 multi-read mappability scores [44]. The observed coverage does not include these scores, as mappability is assumed to be inherently included when performing DNA sequencing. Because Bismap did not include a mappability score for every base in the genome, the expected and observed coverage calculations were restricted to those bases that included a mappability score. Only mapped reads (FLAG field in BAM does not include 0x4 flag) with MAPQ score \(\ge 40\) were included in this calculation. The values for each observed to expected ratio were calculated using custom BASH scripts that used bedtools and GNU awk (version 4.0.2).
The beta values for the lambda phage and pUC19 methylation controls were extracted using the same method as the genomic methylation extraction (see Alignment and Methylation Extraction for the details), with the one exception being that only one read was required to cover a CpG. The coverage requirement difference was due to the fractional amount of lambda phage and pUC19 that were included in the sequencing compared with the amount of genomic DNA. The mitochondrial DNA beta values were taken directly from the genomic methylation BED file.
To calculate correlations between samples, the genome was binned into 100 kb bins and the average beta value calculated for CpGs in each window. The bins were determined via bedtools makewindows -w 100000 -g GRCh38.p13.genome.fa.fai | sort -k1,1 -k2,2n. The “.fai” file was generated via samtools faidx GRCh38.p13.genome.fa. The average beta value for each bin was calculated via bedtools map -a bins.bed -b sample.bed -c 4,5 -o mean | gzip. bins.bed contains the 100 kb bins, while sample.bed is the merged CpG BED file generated from the subsampled BAMs. Correlations between the 100 kb averaged beta values were calculated using the Spearman correlation coefficient, as implemented in Python’s scipy package (version 1.4.1).
The PCA was performed using the average beta values in 100 kb bins, which were calculated in the same way as the correlation analysis. After calculating the beta values, they were transformed into smoothed M-values via:
$$\begin{aligned} logit \left( \frac{M+k}{(M+k) + (U+k)} \right) , \end{aligned}$$
(5)
where M is the number of methylated cytosines and U is the number of unmethylated cytosines at a given CpG. The smoothing factor, k, eliminates the infinities that occur at 0 and 1 in the logit transformation. The numpy logit function was used to apply the transformation. This conversion turns the beta-distributed beta values into the more Gaussian-distributed M-values. After converting to M-values, the data was standardized using the StandardScaler function, as implemented in scikit-learn (sklearn version 0.24.0). The PCA was performed using scikit-learn’s PCA function, keeping only the first two components (n_components=2).
The analysis for methylation bias in the NEBNext Enzymatic Methyl-seq kit was performed by extracting CpGs that had more than 20 reads covering them for both the NEB and Swift protocols for Sample A replicate 1 and 2 or the NEB and Swift protocols for Sample B replicate 1 and 2. Methylation bias was determined by requiring both Equations 6 and 7 to be true.
$$\begin{aligned} |\beta _{NEB} - \beta _{Swift}|&> 0.5~(Replicate~1) \end{aligned}$$
(6)
$$\begin{aligned} |\beta _{NEB} - \beta _{Swift}|&> 0.5~(Replicate~2) \end{aligned}$$
(7)
This was done separately for the standard DNA input NEB and Swift protocols of Samples A and B.
The CpG and other genomic region BED files mentioned in this section can be found on the GitHub release page for the analysis code.