An automated platform for genome-wide profiling of chromatin proteins
To adapt CUT&RUN to an automated format we equipped a Beckman Biomek FX liquid handling robot for magnetic separation and temperature control (Fig. 1a). First, cells are bound to concanavalin A-coated magnetic beads, allowing all subsequent washes to be performed by magnetic separation. Bead-bound samples are then incubated with antibodies, and up to 96 samples are arrayed in a plate (Fig. 1a). Successive washes, tethering of a proteinA-MNase fusion protein, cleavage of DNA, and release of cleaved chromatin fragments into the sample supernatant are performed on the Biomek (Additional file 1: Fig. S1a). A major stumbling block to automating epigenomics protocols is that they typically require purification of small amounts of DNA prior to library preparation. To overcome this hurdle, we developed a method to polish the DNA ends in chromatin fragments for direct ligation of Illumina library adapters (Additional file 1: Fig. S1a). End-polishing and adapter ligation are performed on a separate thermocycler, and deproteinated CUT&RUN libraries are purified on the Biomek using Ampure XP magnetic beads both before and after PCR enrichment. This AutoCUT&RUN protocol allows a single operator to generate up to 96 libraries in 2 days that are ready to be pooled and sequenced (Fig. 1a) (https://www.protocols.io/view/autocut-run-genome-wide-profiling-of-chromatin-pro-ufeetje).
To test the consistency of AutoCUT&RUN, we simultaneously profiled two biological replicates of H1 hESCs and K562 cells using antibodies targeting four histone modifications that mark active chromatin states (H3K4me1, H3K4me2, H3K4me3, and H3K27ac) and one repressive modification (H3K27me3). Comparing the global distribution of reads for each histone mark, we found that samples highly correlate with their biological replicate and cluster together in an unbiased hierarchical matrix (Fig. 1b). Additionally, the genome-wide profiles of the four active histone marks clustered together within a given cell type and separated away from the repressive histone mark H3K27me3 (Fig. 1b). These profiles represent antibody-specific signals, as all five are poorly correlated with an IgG-negative control. Together, these results indicate that AutoCUT&RUN chromatin profiling reproducibly captures the cell-type-specific distributions of histone marks.
In addition to profiling histone modifications, we also examined whether AutoCUT&RUN can be applied to mapping DNA-binding transcription factors. We tested the performance of AutoCUT&RUN with two transcription factors, the histone locus-specific gene regulator NPAT, and the insulator protein CTCF [21, 22]. AutoCUT&RUN profiles of both NPAT and CTCF are highly specific for their expected targets in both H1 and K562 cells (Additional file 1: Fig. S1b, c). Thus, AutoCUT&RUN is suitable for high-throughput, genome-wide profiling of diverse DNA-binding proteins.
Comparison of AutoCUT&RUN to ChIP-seq
We previously showed that the low backgrounds and high efficiency of CUT&RUN allowed for much lower DNA sequencing read depths than required for conventional ChIP-seq to obtain good feature definition. To determine whether improved performance relative to ChIP-seq extends to AutoCUT&RUN, we first identified the histone modification datasets from the ENCODE project that used the same antibodies and manufacturer catalog numbers as we used. A representative region is shown for direct comparison of tracks between AutoCUT&RUN and ENCODE (Fig. 2a). In all comparisons, the ENCODE datasets are seen to be much noisier than the CUT&RUN datasets, despite the fact that there were ~ 2 to 3 times as many mapped reads in the ENCODE datasets, pooling all of the reads from the two replicates available in the Gene Expression Omnibus (GEO). The requirement for much deeper sequencing using ChIP-seq relative to CUT&RUN is illustrated by the effect of downsampling the ENCODE datasets to a number of reads equivalent to the number of CUT&RUN fragments, where in the case of H3K4me1, the feature definition from ChIP-seq becomes dramatically reduced, whereas the same number of mapped CUT&RUN fragments shows clear peaks with much lower background than seen for either the Broad Institute or SYDH ENCODE tracks for all comparisons. We confirmed that the higher data quality of CUT&RUN extends to genome-wide analysis, where heat maps of MACS2 peak calls for CUT&RUN show much better signal-to-noise than heat maps for corresponding ENCODE datasets using ENCODE-generated peak calls (Fig. 2b).
The fixation, sonication, and immunoprecipitation steps of ChIP-seq have the potential to introduce significant batch-effect variability between experiments [23, 24], often making it difficult to directly compare large ChIP-seq datasets generated by different laboratories. To examine whether AutoCUT&RUN reduces this batch-effect variability, we compared the global distribution of reads for H3K4me1 in K562 cells generated from multiple different AutoCUT&RUN and ENCODE ChIP-seq experiments. This analysis revealed that biological replicates profiled by AutoCUT&RUN in different batches have a similar correlation with biological replicates profiled in parallel, indicating there is very little batch-effect variability between AutoCUT&RUN experiments (Fig. 2c). Furthermore, the correlation between AutoCUT&RUN samples and the Broad Institute ChIP-seq samples was similar to the correlation of Broad Institute ChIP-seq replicates with each other (Fig. 2c). In agreement with the visual comparison of tracks (Fig. 2a), this demonstrates the genome-wide profiles of H3K4me1 generated by AutoCUT&RUN are consistent with results obtained using ChIP-seq. However, the correlation between the Broad Institute ChIP-seq replicates and the SYDH ChIP-seq replicate was far lower than that observed between different AutoCUT&RUN batches, reaffirming the inherent difficulty in reproducing ChIP-seq results (Fig. 2c). We conclude that by eliminating many of the potential sources of batch-effects associated with ChIP-seq, AutoCUT&RUN significantly improves reproducibility between experiments, which will facilitate the adaptation for clinical applications.
We next examined whether AutoCUT&RUN profiles recapitulate global chromatin features that have been previously ascribed to hESCs using ChIP-seq. To maintain their developmental plasticity, hESCs are thought to have a generally “open,” hyper-acetylated chromatin landscape interspersed with repressed domains of “bivalent” chromatin, marked by overlapping H3K27me3 and H3K4 methylation [25,26,27,28]. AutoCUT&RUN recapitulates these features of hESCs; we observed that H1 cells have increased H3K27ac as compared to the lineage-restricted K562 cell line, whereas domains of the repressive histone mark H3K27me3 are rare in H1 cells, but prevalent in K562 cells (Fig. 3a). We also observed extensive overlap between H3K27me3 and H3K4me2 signals in H1 cells, but not K562 cells (Fig. 3a, b). Thus, AutoCUT&RUN profiles are consistent with the specialized chromatin features found in hESCs using ChIP-seq.
Post-translational modifications to the H3 histone tail closely correlate with transcriptional activity [29]. To determine whether our AutoCUT&RUN profiles of histone modifications are indicative of transcriptional activity, we examined the distribution of the five histone marks around the transcriptional start sites (TSSs) of genes, rank-ordered according to RNA-seq expression data (Fig. 3c, d) [30]. We find the active mark H3K4me3 is the most highly correlated with expression in both cell types (r = 0.70 and 0.81 for H1 and K562, respectively), followed by H3K4me2 and H3K27ac (Fig. 3c, d). The repressive histone mark H3K27me3 is anti-correlated with expression (r = −0.16 and −0.53 in H1 and K562, respectively) (Fig. 3c, d). We conclude AutoCUT&RUN for these histone marks provides a strategy to identify cell-type-specific gene regulatory programs.
Modeling cell-type-specific gene expression from AutoCUT&RUN profiles
To use AutoCUT&RUN data to compare cell types and distinguish their gene regulatory programs, we wanted to develop a continuous metric that incorporates both active and repressive chromatin marks. RNA-seq has been widely used to identify cell-type-specific gene expression programs [30], so we used RNA-seq data as a reference for training a weighted linear regression model that incorporates normalized H3K4me2, H3K27ac, and H3K27me3 read counts to assign promoters a relative activity score. We initially focused our analysis on genes with a single TSS that could be unambiguously assigned RNA-seq values. H3K4me2 was selected over H3K4me3 and H3K4me1 because H3K4me2 is uniquely applicable for modeling the activity of both proximal and distal cis-regulatory elements (see below). When applied to K562 cells, promoter chromatin scores correlate very well with RNA-seq values (r = 0.83) (Fig. 4a), providing a comparable power for predicting gene expression as similar models that used up to 39 histone modifications mapped by ChIP-seq (r = 0.81) [29]. In addition, our weighted model trained on K562 cells performs well when applied to H1 cells (Additional file 1: Fig. S2a, b), indicating that the linear model and data quality are sufficiently robust to assign promoter scores to diverse cell types.
Next, we examined whether AutoCUT&RUN accurately identifies promoters with cell-type-specific activity. By calling promoter scores that were enriched more than twofold in either H1 or K562 cells, we identified 2168 cell-type-specific genes and approximately 40% of these genes (865) were also differentially enriched between H1 and K562 cells according to RNA-seq (Fig. 4b–d). However, promoter activity modeling did not capture transcriptional differences for 1149 genes (Fig. 4d, Additional file 1: Fig. S2c, d), implying that these genes are differentially expressed without changes in the chromatin features included in our model. This differential sensitivity between methods suggests the three histone marks included in our chromatin model may more accurately predict the cell-type-specific expression of certain classes of genes than others. Indeed, we find the 865 cell-type-specific genes identified by both promoter activity modeling and RNA-seq are highly enriched for developmental regulators, whereas the genes called by either promoter scores or RNA-seq alone are not nearly as enriched for developmental GO terms (Fig. 4d, Additional file 1: Fig. S2e–g, Additional file 2: Table S1). In addition, only 35 genes display contradictory cell-type specificities according to promoter chromatin scores and RNA-seq (Fig. 4d). This demonstrates AutoCUT&RUN profiling of these widely studied modifications to the H3 histone tail can be applied to accurately distinguish between cell-type-specific developmental regulators.
To determine whether AutoCUT&RUN data recapitulate the expression of cell-type-specific transcription factors, we expanded our analysis to include all promoters. We find that components of the hESC pluripotency network (NANOG, SOX2, SALL4, and OTX2) have higher promoter chromatin scores in H1 cells, while regulators of hematopoietic progenitor cell fate (PU.1, TAL1, GATA1, and GATA2) are enriched in K562 cells (Fig. 4e, Additional file 2: Table S1) [31, 32]. This method also identifies activities of alternative promoters (e.g., at the OTX2 and TAL1 genes), providing an indication of the specific gene isoforms that are expressed in a given cell type (Fig. 4e). We conclude that AutoCUT&RUN can distinguish between master regulators of cellular identity, providing a powerful tool to characterize cell-types in a high-throughput format.
Profiling tumors by AutoCUT&RUN
Typical clinical samples often contain small amounts of material and have been flash-frozen, and although ChIP-seq has been applied to flash-frozen tissue samples, available methods are not sufficiently robust for diagnostic application. In addition, translational samples from xenografts, which are increasingly being used in clinical settings to probe treatment strategies for patients with high-risk malignancies [34]. These specimens can be extremely challenging to profile by ChIP-seq as they often contain a significant proportion of mouse tissue and so require extremely deep sequencing to distinguish signal from noise. To test whether AutoCUT&RUN is suitable for profiling frozen tumor specimens, we obtained two diffuse midline glioma (DMG) patient-derived cell lines (VUMC-10 and SU-DIPG-XIII) that were autopsied from similar regions of the brainstem, but differ in their oncogenic backgrounds [33]. SU-DIPG-XIII is derived from a tumor containing an H3.3K27M “oncohistone” mutation, which results in pathologically low levels of PRC2 activity, and because of this has been called an “epigenetic” malignancy. In contrast, VUMC-10 is a MYCN-amplified, histone wild-type brainstem glioma [34]. Both of these DMG cell lines readily form xenografts in murine models, and we applied AutoCUT&RUN to profile histone modifications in VUMC-10 and SU-DIPG-XIII xenografts that were seeded in the brains of mice and then resected upon tumor formation and frozen under typical clinical conditions (Fig. 5a). For comparison, on the same AutoCUT&RUN plate we profiled the parental DMG cell lines grown in culture (Fig. 5a). Again, we found that replicates were highly concordant, so we combined them for further analysis. Importantly, cell culture samples were highly correlated with the same mark profiled in the corresponding frozen xenografts, and AutoCUT&RUN on xenograft tissues and cell culture samples produced similar data quality (Fig. 5b, Additional file 1: Fig. S3). Thus, AutoCUT&RUN reliably generates genome-wide chromatin profiles from frozen tissue samples.
Stratification of patient malignancies is becoming increasingly dependent on molecular diagnostic methods that distinguish tumor subtypes derived from the same tissues. Our VUMC-10 and SU-DIPG-XIII samples provide an excellent opportunity to explore the potential of using AutoCUT&RUN to classify tumor specimens according to their subtype-specific regulatory elements. By applying promoter modeling to these samples, we identified 5006 promoters that show differential activity between VUMC-10 and SU-DIPG-XIII cells (Fig. 6a, Additional file 2: Table S1). Consistent with the glial origins of these tumors, both the VUMC-10- and SU-DIPG-XIII-specific promoters are significantly enriched for genes involved in nervous system development (Additional file 1: Fig. S4a, b). Genes involved in cell signaling are also overrepresented in SU-DIPG-XIII cells (Additional file 1: Fig. S4b); for example, the promoters of the PDGFR gene as well as its ligand PDGF are highly active in SU-DIPG-XIII cells (Fig. 6a). This is consistent with the observation that DMGs frequently contain activating mutations in PDGFR-α that promote tumor growth [5]. In addition, one promoter of the SMAD3 gene, a component of the TGF-β signaling pathway [35], is specifically active in SU-DIPG-XIII cells, whereas two different SMAD3 promoters are active in VUMC-10 cells (Fig. 6a, Additional file 1: Fig. S3). In comparison, our model indicates that only 388 promoters differ between VUMC-10 xenografts and cultured cells, and 1619 promoters differ between SU-DIPG-XIII samples (Fig. 6b, Additional file 1: Fig. S5c). In addition, comparing promoter chromatin scores in an unbiased correlation matrix also indicates DMG xenografts are far more similar to their corresponding cell culture samples than they are to other DMG subtypes or to H1 or K562 cells (Fig. 6c). This suggests that AutoCUT&RUN can be applied to identify promoters that display tumor subtype-specific activity, providing a reliable method to assign cellular identities to frozen tumor samples, as well as an indication of the signaling pathways that may be driving tumor growth and potential susceptibility to therapeutic agents.
High-throughput mapping of cell-type-specific enhancers
The cell-type-specific activities of gene promoters are often established by incorporating signals from distal cis-regulatory elements, such as enhancers [1, 2]. Similar to promoters, enhancers also display H3K4me2 [36], and active enhancers are typically marked by H3K27ac, whereas repressed enhancers are marked by H3K27me3 [28, 37, 38]. Therefore, we reasoned that the AutoCUT&RUN profiles we used to model promoter activity should also allow identification of cell-type-specific enhancers. To investigate this possibility, we first compared our H1 data to available chromatin accessibility maps generated by ATAC-seq, which are enriched for both active promoters and enhancers [39, 40]. Of the marks we profiled, we find H3K4me2 peaks show the highest overlap with ATAC-seq (Fig. 7a, Additional file 1: Fig. S5a), and identify 36,725/52,270 ATAC-seq peaks (~ 70%). Interestingly, H3K4me2 defines an additional 71,397 peaks that were not called by ATAC-seq (Fig. 7a, Additional file 1: Fig. S5a). Many of these H3K4me2-specific peaks show a low, but detectable ATAC-seq signal (Additional file 1: Fig. S5b), indicating they may correspond to repressed promoters and enhancers. Consistent with this interpretation, on average H3K4me2+/ATAC-TSSs have higher H3K27me3 signals than H3K4me2+/ATAC+ TSSs (Additional file 1: Fig. S5c). H3K4me2+/ATAC+ peaks that overlap with annotated TSSs are enriched for H3K4me3, while those peaks that do not overlap TSSs are enriched for H3K4me1 (Fig. 7b, c, Additional file 1: Fig. S5d), suggesting that many of these distal peaks are enhancers [28, 41]. Thus, mapping sites of H3K4me2 by AutoCUT&RUN provides a sensitive method for defining the repertoire of active cis-regulatory elements that control gene expression programs.
Finally, we examined whether AutoCUT&RUN can be used to identify cell-type-specific enhancers. To expand the number of putative enhancer sites, we compiled a list of non-TSS peaks called on H3K4me2 profiles from all six cell lines and xenograft samples. Using our linear regression model, we then assigned these elements chromatin scores and examined their correlations between different cell types. We find that the chromatin scores of DMG cell culture samples and xenografts are highly correlated (r = 0.75 and 0.87 for SU-DIPG-XIII and VUMC-10 samples, respectively) (Fig. 7d). In contrast, the chromatin scores of SU-DIPG-XIII cells show a weak positive correlation with VUMC-10 cells (e.g. r = 0.19), indicating tumor subtype-specific differences. For example, different enhancers near the SOX2 pluripotency gene are active in VUMC-10 cells than SU-DIPG-XIII or H1 cells (Fig. 7e), indicating that SU-DIPG-XIII cells resemble a more primitive neural stem cell type than VUMC-10 cells, as has been previously suggested [42]. Thus, modeling enhancer activity from AutoCUT&RUN profiles of chromatin marks is a highly discriminative method for stratifying cell types and tissue samples to inform patient diagnosis.