Peak calling based on fragment block aggregation
To address the problem of oversensitivity in CUT&RUN peak calling, we developed Sparse Enrichment Analysis for CUT&RUN (SEACR), a peak calling algorithm that enforces precision from sparse data by quantifying the global distribution of background signal and using it to set a stringent empirical threshold for peak identity. CUT&RUN data from target antibody and IgG control experiments are first parsed into signal blocks representing segments of continuous, nonzero read depth by fragment spanning read pairs, and the signal in each block is calculated by summing read counts (Fig. 1a). A plot of the proportion of signal blocks in target versus IgG (y-axis) is used to identify the threshold value at which the percentage of target versus IgG blocks is maximized; then target blocks failing to meet the threshold are filtered out, leaving enriched peaks (Fig. 1a). We also filtered out blocks that overlap an IgG block meeting the threshold as a means to eliminate spurious peaks that arise either through multiple mapping at repeated regions or at false-positive sites [12].
We evaluated the effectiveness of using signal blocks as a metric for discrimination between potential true- and false-positive peaks in comparison with common strategies employed by ChIP-seq peak callers. For CUT&RUN datasets profiling K562 cells for H3K4me2, H3K4me3, H3K27me3, and CTCF at several different read subsampling levels, we used SEACR’s block aggregation utility to sort all signal blocks in the genome by total signal, and in parallel called peaks using MACS2 [13] or HOMER [14] with maximally relaxed peak cutoffs as detailed in the "Methods" section. We then used comparisons with validated ChIP-seq peak calls from ENCODE to plot precision–recall curves for each subsampled dataset for each of the three cutoff strategies (total signal in signal block for SEACR, -log10(FDR) for MACS2, and “Peak Score” for HOMER) and calculated areas under the curve (AUPR). SEACR AUPR was competitive with MACS2 and HOMER for all datasets interrogated and outperformed both of them across all read subsampling levels for CTCF (Fig. 1b, Additional file 1: Fig. S1A–C). Notably, though running MACS2 without a local lambda parameter to imitate the non-local peak identification of SEACR improved performance of MACS2 peak calling for H3K27me3 data, it made a negligible difference in performance for H3K4me2, H3K4me3 and CTCF. These results confirm that total signal in signal blocks is a valid metric for discriminating enriched regions from CUT&RUN data.
SEACR thresholding validates “gold standard” CUT&RUN data
Peak calling programs often feature several options for parameter selection, which afford flexibility to adapt to specific use cases, but also complicates analysis for users. Thus, we introduced only two major options for SEACR. First, we offer the option of either providing a control IgG dataset or choosing a global threshold value (default = IgG). The provision of a control IgG dataset is recommended for general use and is used in the remainder of the manuscript. Second, we implemented “stringent” and “relaxed” peak selection modes (default = stringent), which correspond to the threshold at which the maximum percentage of target versus IgG signal blocks are retained, or a threshold halfway between the maximum and the “knee” of the target percentage curve, respectively (Fig. 2a).
To test how well SEACR can avoid false-positive peaks in a CUT&RUN dataset in which such distinctions are known, we used SEACR, MACS2, and HOMER to call peaks from CUT&RUN data for two transcription factors (TFs), Sox2 and FoxA2 in human embryonic stem cells (hESCs) and definitive endoderm (DE) cells. Sox2 expression is restricted to hESCs, and FoxA2 expression is restricted to DE cells [15]. All three methods called a comparable number of peaks for Sox2 in hESCs or FoxA2 in DE cells. In contrast, SEACR called only 1–2 peaks for each factor when it is not expressed (Fig. 2a, “stringent”). HOMER and MACS2 called up to ~ 900 spurious peaks in these datasets using default peak calling thresholds for each program; these trends held when analyzing total bases covered by peaks or percentage of reads in peaks (Fig. 2—Additional file 2: Fig. S2A, B, “stringent”). Notably, this high selectivity of SEACR was maintained when running in “relaxed” mode (Fig. 2b, Additional file 2: Fig. S2A, B, “relaxed”). These results indicate that SEACR outperforms popular ChIP-seq peak callers in avoiding false-positive peak calls. Indeed, the combination of CUT&RUN data with SEACR peak calling results in nearly complete exclusion of false positives, which validates the trustworthiness of the CUT&RUN method.
SEACR default thresholds are robust over a wide range of read depths
To test the performance of SEACR default thresholds in comparison with thresholds set by ChIP-seq peak callers at low read depths, we called peaks from H3K4me2 CUT&RUN data subsampled 10 times each at 12 different read depths spanning from 2 million to 45 million reads. We used default thresholds for SEACR with both stringent and relaxed modes, MACS2 in standard or local lambda-inactivated “narrow peak” mode, and HOMER in “factor” mode. We then compared peaks with peaks called by the ENCODE consortium using MACS2 on ENCODE ChIP-seq data, assigning ENCODE peaks meeting a −log10 FDR threshold of greater than 10 as a stringent “truth set”. In both stringent and relaxed modes, SEACR consistently maximized the fraction of called peaks that were inside the test set (precision), with all tests in stringent mode and most in relaxed mode surpassing 85% precision (Fig. 3a). In contrast, all tests using MACS2 or HOMER with default cutoffs failed to reach the same precision. This indicates that the precision of SEACR is robust across a range of read depths.
To analyze the general optimality of combined recall and precision for each peak caller, we calculated the F1 score for each peak caller at each read subsampling level, such that larger F1 scores corresponded with higher performance in a combination of the two metrics. SEACR relaxed mode exhibited superior performance at all subsampling levels above ~ 7.5 million reads (Fig. 3b, blue curve). To account for the fact that peak callers such as MACS2 have parameters that can be optimized to adjust the desired precision–recall balance, we selected a stringent set of peaks from the MACS2 peak calls that meet a −log10(FDR) threshold of greater than 10, and recalculated F1 scores in comparison with SEACR. Although the more stringent MACS2 peak calls had improved performance above 10 million fragments, performance suffered at fragment subsampling levels below 10 million reads, rendering SEACR superior at those levels (Fig. 3b, magenta curve). Therefore, SEACR thresholds remain competitive with widely used ChIP-seq peak callers across multiple parameter selection strategies, even in the absence of arbitrary user input for the purposes of optimization. Although our conclusions are based on the presumption that high-scoring ENCODE peaks are true positives, the fact that they were originally called using MACS2 leads us to expect that the superior performance of SEACR on CUT&RUN data will generalize to any set of true positives. Thus, SEACR is an accurate peak caller for CUT&RUN data across a range of read depths and maintains a high percentage of true positive peak calls at low read depth.
SEACR retains broad domain structures
Peak calling is often confounded by the diverse distributions of chromatin proteins and modifications; for instance, transcription factors are expected to cluster at narrow genomic loci in peaks, whereas many histone modifications such as H3K27me3 cover broad regions that are not easily summarized by methods that detect peaks. Since our signal block approach is agnostic to region width, we reasoned that SEACR might be equally successful at identifying broad domains as it is for the peaks identified from Sox2, FoxA2, and H3K4me2 data. To test performance on broad domains, we called peaks using SEACR, MACS2, and HOMER (using “stringent”, “broad”, and “histone” settings, respectively) from an H3K27me3 CUT&RUN dataset using K562 cells [16] that contains broad domains by visual inspection. Remarkably, though SEACR called many fewer enriched regions than MACS2 or HOMER (28803, 97247, and 104524, respectively), SEACR regions covered more sequence (31.4 Mb) than either (28.1 Mb and 18.3 Mb), indicating that SEACR regions are broader. Indeed, the average width of SEACR regions exceeded that of MACS2 and HOMER by nearly an order of magnitude (Fig. 4a). Visual inspection of loci with broad H3K27me3 domains such as the HOXD cluster indicates that whereas MACS2 and HOMER partition the domain into several subregions, SEACR maintains the majority of the domain structure in a limited number of large signal blocks (Fig. 4b). These data indicate that SEACR is a useful tool for identifying large domains in CUT&RUN data in addition to spatially restricted binding sites.
SEACR exhibits competitive run time and read–write memory allocation
To evaluate SEACR’s run time and memory usage in comparison with other peak callers, we measured run time, average disk read memory, and average disk write memory for instances of H3K27me3 peak calling described above. SEACR was superior or competitive with all peak callers tested in all three metrics (Fig. 5a–c). Notably, for every read subsampling level tested, SEACR finished in less than 5 min on average, demonstrating that SEACR is scalable for datasets of vastly different sizes (Fig. 5a). We conclude that SEACR is a fast and efficient method for peak calling from a range of CUT&RUN datasets.
A web server for rapid desktop analysis using SEACR
In the interest of disseminating SEACR as widely as possible to users with limited expertise in bioinformatics, we developed a SEACR web server (http://seacr.fredhutch.org) for plug-and-play analysis of CUT&RUN data (Fig. 6a, b). The web server accepts bedgraph files of up to 500 Mb as inputs, enables users to toggle desired parameters (normalized or non-normalized mode, and stringent or relaxed mode), reports progress during run time, and provides a link to a downloadable BED file containing the results. We anticipate that the SEACR web server will make rapid CUT&RUN analysis accessible for the broader community.