CFP1 primarily binds at active CGI TSSs and is associated with transcription
We previously reported an association of CFP1 binding with transcription at the α-globin locus, using a mouse model in which the mouse α-globin locus was replaced by the human locus [29, 31]. In this “humanised” mouse model, the epigenetic regulation of the human locus in mouse erythroblasts mirrors that observed in human primary erythroblasts, including recruitment of PcG when the gene is silenced [29]. In these humanised mouse erythroid cells, the targeted deletion of a key human α-globin enhancer (MCS-R2, also known as HS-40) results in a strong reduction in α-globin transcription [31, 32]. This is associated with the maintenance of PcG binding [29] and an impairment of CFP1 recruitment to the human α-globin promoter [29]. Similarly, in human primary erythroid and lymphoid (EBV-transformed) cell types, where this locus is active and inactive, respectively, CFP1 is recruited to the CGI spanning the α-globin gene only in erythroid cells [29]. Experiments with a robust, non-commercial CFP1 antibody [30] confirmed these results in primary cells (Additional file 1: Fig. S1A) and in the humanised mouse model (Additional file 1: Fig. S1B). ChIP signals derived from this antibody were several fold stronger than a comparable commercial CFP1 antibody (Additional file 1: Fig. S1C). The non-commercial antibody was tenfold more sensitive (14,872 peaks, as described below, versus 1434 in erythroid cells), and in 1102 locations where the peaks overlapped it showed a median 2.3 fold higher enrichment of reads in peaks given a similar number of ChIP-seq reads and the same input data. Given this tenfold greater sensitivity and 2.3-fold greater specificity, it was the primary antibody used in this study.
To identify genomic regions recruiting CFP1, we performed ChIP-seq in primary human lymphoid and erythroid cell types (biological replicates from Fig. S1A, Additional file 1). As shown in Fig. 1a, heatmaps of genomic TSSs and CGIs showed that these regions were strongly enriched in CFP1 binding. As shown in Fig. 1b, genomic CFP1 localisation was remarkably similar in both cell types, with a large number of peaks located at TSSs with annotated CGIs: 62.2% (9251/14,872) and 42.5% (9055/21,301) of genome-wide CFP1 summits were associated with CGI promoters in erythroid and lymphoid cells, respectively. Amongst non-CGI genomic regions of CFP1 binding, TSSs were less prominent (Fig. 1b). As shown in Fig. S1D (Additional file 1), CGI and non-CGI TSSs differed strongly in CpG content within a 1-kb window centred on TSSs. CpG content of representative CGI and non-CGI loci considered hereafter are graphed in Fig. S1D (Additional file 1) .
As shown in Fig. 1a, CFP1-bound TSSs were depleted in H3K27me3 and H3K4me1 histone marks, whilst being strongly enriched in H3K4me3. The association between CFP1 and H3K4me3 could either reflect deposition of H3K4me3 by CFP1-SET1A/B complexes, or binding to pre-existing H3K4me3, deposited by other TrxG complexes [33], through the PHD finger domain of CFP1 [17]. Accordingly, we found strong association between peaks of CFP1 and H3K4me3 in both erythroid and lymphoid cell types, both qualitatively at the level of heatmaps (Fig. 1a) and quantitatively in peak overlaps (Fig. 1c). Indeed, the fraction of CFP1 peaks overlapped with H3K4me3 was nearly 100% when the alternate peak finder, MACS2, was used (Additional file 1: Fig. S2).
In contrast to H3K4me3 and CFP1, little if any colocalisation was observed between CFP1 and H3K27me3. To further characterise depletion of H3K27me3 (a histone mark associated with PcG binding) in CFP1-bound regions, we plotted averaged read depth in a 2-kb window surrounding TSSs for CFP1 versus H3K27me3. As shown in Fig. 1d, TSSs showed mutual exclusivity between H3K27me3 and CFP1 occupancy, with marked preference for binding of either CFP1 or PRC2 (H3K27me3), or neither, but not both; this was observed consistently in erythroid and lymphoid cell types. This result demonstrates that mutual exclusivity of CFP1 and the H3K27me3 mark at TSSs, which was first described in the α-globin locus [29], is a general genome-wide mechanism in disparate cell types.
Binding of CFP1 to housekeeping and tissue-specific gene TSSs
We next assessed CFP1 occupancy at TSSs by gene class, specifically housekeeping genes and developmentally regulated/tissue-specific genes (Fig. 2 and Additional file 1: Fig. S3). Both the α-globin locus in erythroblasts (Fig. 2a, left) and IRF4 locus in lymphoid cells (Fig. 2a, right) exhibit lineage-specific binding of CFP1. Housekeeping genes such as ACTB and LUC7L (located just downstream of the α-globin locus) showed binding in both cell types (Fig. 2b). The CGI promoter of RHBDF1, which is not expressed in either cell type, was marked by H3K27me3 in both erythroid and lymphoid cell types (Additional file 1: Fig. S3C), but was not bound by CFP1 (Fig. 2c).
We then extended our analysis to genome-wide sets of housekeeping and developmentally regulated/tissue-specific genes. First, we used the Illumina Body Map RNA-seq data set of 16 human tissues to identify 5354 genes with similar expression across most or all of these tissues and termed these housekeeping genes (Fig. 2d; see “Methods”). Another 7150 genes with low expression in nine or more of 16 tissues were termed candidate tissue-specific genes. RNA-seq data sets from erythroblasts and EBV-transformed lymphoid cells were then used to identify candidate tissue-specific genes with elevated expression in these cell types (412 genes in erythroid and 658 in lymphoid cells; see “Methods”). Interestingly, CXXC1 mRNA, which encodes CFP1 protein, was denominated a housekeeping gene, with mean RPKM (reads per kilobase of transcript per million mapped reads) of 5.8 (± 3.1 SD). Public expression data sets for EBV and ERY cells showed overexpression of CXXC1 in EBV (RPKM = 50.6) and reduced expression in ERY (RPKM = 0.62). This mRNA expression difference corresponded to approximately twofold higher expression at the protein level in lymphoid cells (Additional file 1: Fig. S4).
To address the confounding effect of alternate TSSs of transcribed genes, some with or without CGIs, and some transcribed or not, we next limited our analysis to putatively expressed TSSs. These were identified by accessible chromatin, which was defined by the presence of 1x-normalised/input-subtracted ATAC (Assay for Transposase-Accessible Chromatin) signal > 10 within 1 kb of the TSS. Most putatively expressed TSSs regardless of gene class were marked by CGIs (defined by a distance < 1 kb; green dotted lines, Fig. 2d). However, a larger proportion of housekeeping gene TSSs were marked by a CGI (94.7% in erythroid and 96.8% in lymphoid cells), compared to TSSs of tissue-specific genes (77.5% in erythroid and 82.5% in lymphoid cells). CGIs accounted for most of the CFP1 occupancy at TSSs, an observation recapitulated in both gene classes and both cell types, and CFP1 peaks were biased for CGI TSSs in all gene classes, as expected (OR ≥ 5.9, p ≤ 1.6 × 10−12, Additional file 1: Table S1). Notably, though, housekeeping genes showed increased frequency of CFP1 peaks compared to tissue-specific genes, and this was true in both cell types in both CGI and non-CGI TSSs (Additional file 1: Table S2). Taken together, these findings reinforce the observation that CFP1 occupancy marks active CpG-rich TSSs of both housekeeping and tissue-specific genes, but also reveal that CFP1 is preferentially associated with TSSs of broadly expressed genes whether or not they meet strict criteria for housekeeping genes (see “Methods”).
CFP1 binds to non-CGI regions associated with transcription
Remarkably, although we observed significant CFP1 bias for CGI TSSs, our data also revealed that CFP1 was bound to a proportion of accessible non-CGI TSSs: for example, 36/78 (46%) and 7/44 (16%) tissue-specific TSSs in ERY and EBV, respectively (Fig. 2d). More than half of non-CGI housekeeping TSSs were occupied by CFP1, consisting of 214/335 (64%) TSSs in ERY and 84/139 (60%) TSSs in EBV (Fig. 2d). In addition to expressed non-CGI TSSs, we found the vast majority of annotated, expressed CGI TSSs occupied by CFP1, including 233/269 (86%) tissue-specific and 5600/5948 (94%) housekeeping TSSs in ERY. Low CpG density, however, was not necessarily correlated with a lack of CFP1 binding, as exemplified by the well-characterised β-globin gene HBB. The HBB locus is CpG-poor (Additional file 1: Fig. S1D), but shows similar CFP1 ChIP signal intensity in erythroid cells to that observed at α-globin gene promoters (Fig. 2e). Experiments using the previously described CFP1 antibody from Abcam confirmed these results, albeit with weaker ChIP signal (Additional file 1: Fig. S5). These findings call into question the previously reported CGI specificity for this protein [14] and raise the possibility that the recruitment of CFP1 at non-CGI sites may occur by protein–protein interactions with pre-existing H3K4me3, due to the reader properties of the plant homeodomain (PHD) of CFP1 [17, 26, 33].
Coincident CFP1 and Pol II binding at accessible, expressed TSSs
We next examined more closely the relationship between CFP1, Pol II and expression level using public RNA-seq data sets. We limited our analysis to accessible TSSs (defined by overlap with a 1x-normalised, input-subtracted ATAC-seq signal > 10); this accessibility makes these strong candidate TSSs for the observed expression (Fig. 3a, b). Accessible TSSs of expressed genes in both cell types were occupied with CFP1, regardless of the presence of a CGI (Fig. 3c–f). A few accessible TSSs appeared non-expressed in the available public erythroid data set (GSE74246) while displaying some CFP1 binding in our independent data set (bottom of Fig. 3c, d). We considered it most likely that these are small artefacts due to variability between RNA-seq data sets. Putatively accessible, expressed TSSs were almost universally marked by Pol II (Fig. 3g–j); however, strength of Pol II ChIP signals was only poorly correlated with expression, in line with the observations of others [34]. Plotting Pol II versus CFP1 occupancy at accessible TSSs (Fig. 3k–n), we found that TSSs with increased Pol II intensity also exhibited generally increased CFP1 intensity. The correlation between Pol II and CFP1 intensity at these loci was weak in EBV cells (Pearson R = 0.28, p < 10−200, and R = 0.26, p = 1.4 × 10−12, in CGI and non-CGI TSSs, respectively); however, this correlation was visually identifiable in ERY (Pearson R = 0.61, p < 10−200, and R = 0.71, p < 10−200, in CGI and non-CGI TSSs, respectively). Within the limits of public data sets, these results demonstrate a weak positive relationship between Pol II and CFP1 at accessible, transcribed TSSs.
Occupancy of CFP1 at enhancers
Unmethylated CpG dinucleotides at TSSs are known to recruit CFP1 through its ZF-CxxC domain [14]. However, recent studies have demonstrated that, in addition, CFP1 can bind all three H3K4 methylation states with varying affinity via its PHD zinc finger domain [17, 26, 33]. Frequent localisation of H3K4me1 at enhancers could therefore predict the presence of CFP1 at enhancers.
To address localisation of CFP1 at enhancers, putative enhancers were defined by open chromatin (ATAC-seq) and the presence of Pol II in non-TSS regions (Fig. 4a, see “Methods”). Elevated DNase hypersensitivity, strong bimodal H3K4me1 signal and H3K27ac surrounding these loci confirmed that these sites are likely active regulatory regions. The presence of little or no H3K4me3 and separation from known TSSs rules out a role for these sites as alternative promoters. The elevated ATAC-seq signal in these regions was largely tissue specific, with 10,548 and 3956 putative enhancers identified in erythroid and lymphoid cells, respectively; only 480 overlaps were observed between these sets of putative enhancers. It should be noted that these putative enhancers represent only a stringent subset with an arbitrarily high ATAC-seq signal, and thus they represent the best-supported candidate enhancers based on the data sets available.
Elevated CFP1 signal was observed at putative enhancers in both cell types, and heatmap evidence suggested qualitative association between CFP1 and signatures of open chromatin (elevated ATAC-seq, DNase-seq, H3K27ac and H3K4me1 signals, Fig. 4a). Indeed, averaged in 2-kb windows around putative enhancers, putative enhancers with above-median ATAC-seq and DNase-seq signals showed elevated CFP1 signal (p < 10−200, t test, both cell types); this, however, was less intense than in accessible TSSs (Fig. 4b). Furthermore, the correlation of CFP1 signal with chromatin accessibility signals was poorer in enhancers than in TSSs for both cell types (Fig. 4a, c). Nonetheless, CFP1 signal intensity was high at putative enhancers associated with a CpG island. Furthermore, intergenic CFP1 peaks (Fig. 1b) showed strong, non-random overlap with our set of putative enhancers. In erythroid cells, 444/1673 CFP1 peaks were colocalised with putative enhancers, which cover 0.16% of the genome (166-fold overrepresented, p < 10−200, binomial test); in lymphoid cells a weaker but specific overrepresentation was noted, with 159/5488 intragenic CFP1 peaks colocalising with this stringently defined set of putative enhancers (18.1-fold overrepresented, p = 1.5 × 10−137). These findings demonstrate the presence of CFP1 at enhancers and demonstrate that its association with enhancers is specific.
CFP1 colocalises with members of the SET1 complexes
The chromatin-binding properties of CFP1, aided by a SET1 subunit, anchor SET1A/B complexes in accessible chromatin [12, 13, 17, 26, 35], and these SET1A/B—along with other TrxG—complexes are responsible for H3K4 methylation in accessible chromatin [13, 18,19,20, 27, 28, 36,37,38]. To investigate this, we performed additional ChIP-seq experiments in erythroid cells. We examined genomic occupancies of SET1A (SETD1A) and Host Cell Factor 1 (HCF1/HCFC1), subunits of the SET1A complexes (Fig. 5, Additional file 1: Fig. S6), and RBBP5, a core subunit of all TrxG complexes [11, 39].
To probe the composition of these complexes, we first asked to what extent observed peaks of CFP1, SET1A, HCF1 and RBBP5 were consistent with known composition of SET1A complexes. Upon examination of these ChIP peak overlaps within a high-confidence subset of peak regions (Fig. 5c, see “Methods”), three observations supported the concept that these proteins are subunits of the same complex. First, we observed strong colocalisation between SET1A and CFP1 such that 92.9% (4266/4594) of SET1A peaks were colocalised with 59.4% of the 7179 CFP1 peaks (red outline, Fig. 5c). This observation is consistent with a role for CFP1 as a subunit recruiting SET1A and other proteins to bind to DNA. Furthermore it is consistent with the known difference between SET1A and SET1B localisation [38]: excess CFP1 peaks likely are accounted for by SET1B complexes. Secondly, we noted that 90.9% (6528/7179) of CFP1 peaks were colocalised with RBBP5 (blue outline, Fig. 5c), which is known to form part of a catalytic ASH2L-RBBP5 heterodimer that activates methyltransferase activity in TrxG complexes [36]. This observation suggests a near-complete association of genome-bound CFP1 with RBBP5. Thirdly, we found that a large majority (85.0%, 2047/2408) of HCF1 peaks were colocalised with CFP1, SET1A and RBBP5 (bold font, Fig. 5c). This and the fact that HCF1 colocalisation accounted for a substantial fraction of these three other peak-sets is consistent with CFP1, HCF1 and RBBP5 being obligate members of SET1A complexes. Taken together, this analysis shows that in regions of high-confidence ChIP signals: (1) there is intimate and specific association between SET1A complex components; (2) genomic SET1A and HCF1 are largely or entirely colocalised with CFP1; and (3) the vast majority of genome-bound CFP1 is associated with RBBP5.
CFP1-SET1A complexes are specialised to TSSs relative to other TrxG complexes
Notably, the above analysis of SET1A complex subunits showed that RBBP5 had a small but substantial fraction of remaining peaks not colocalised with these subunits, suggestive of additional genomic roles for this protein. Therefore, in an expanded analysis, we considered colocalised SET1A and CFP1 (CFP1-SET1A) as representative of SET1A complexes, Menin as a representative of MLL1/2 complexes and UTX as a representative of MLL3/4 complexes (Additional file 1: Fig. S7). This analysis demonstrated that the excess of RBBP5 peak regions unaccounted for by CFP1-SET1A was accounted for by representatives of the MLL1/2 and MLL3/4 complexes. It also showed that genomic occupancy of HCF1 was almost completely accounted for by overlaps with CFP1-SET1A and Menin, which are representatives of SET1 and MLL1/2, respectively. These observations support the view that CFP1, SET1A, HCF1, Menin, UTX and RBBP5 represent a set of proteins whose subunit interactions define compositionally distinct complexes that nevertheless overlap in their genomic occupancy.
We next asked if genomic occupancy of CFP1-SET1A complexes was specialised relative to other TrxG complexes. Specifically, we hypothesised that CFP1-SET1A complexes could differ from other TrxG complexes in their localisation to genomic TSSs and putative enhancers defined earlier. Removing from consideration the small numbers of exclusive CFP1-SET1A (n = 9), Menin (n = 583) and UTX (n = 507) peak regions not colocalising with other TrxG complex subunits (Additional file 1: Fig. S7), we found evidence for substantial site specialisation by the various TrxG complexes (Fig. 5d). CFP-SET1A, Menin and UTX peaks were allocated similarly (approximately 5%) to non-CGI TSSs. In CGI TSSs, all were present at high but differing proportions: 86% of CFP1-SET1A peaks, representing SET1A complexes, were allocated to CGI TSSs, compared to 74% of Menin peaks, representing MLL1/2 complexes, and 51% of UTX peaks, representing MLL3/4 complexes. In our previously defined enhancers, we found the opposite, with putative enhancers representing 6% of CFP1-SET1A peaks, 9% of Menin (MLL1/2) peaks and 15% of UTX (MLL3/4) peaks. In other non-TSS sites of colocalised ChIP peaks, where our conservative criteria did not detect active chromatin, the pattern at enhancers was recapitulated, suggesting that many of these sites act as enhancers. Interestingly, 1409 loci (Additional file 1: Fig. S7) that were external to CFP1-SET1A peaks nevertheless harboured UTX, RBBP5 and Menin (a member of MLL1/2 complexes). Approximately half of these (n = 672) were at TSSs. MLL2 is known to deposit H3K4me3 to establish bivalent domains [27, 28, 40], and UTX is a demethylase known to remove repressive H3K27me3 marks [41]. Finding these proteins colocalised is consistent with physical interaction of UTX with MLL2 complex, a phenomenon postulated from co-immunoprecipitation experiments in differentiated cell lines [40]. Taken together, this analysis of regions of high-confidence ChIP signals shows: (1) TrxG subunits are bound to the genome mostly as multi-protein complexes, and it is mostly the presence of these complexes that gives rise to the observed ChIP signals; (2) although components of all TrxG complexes may be found at TSSs, CFP1-SET1A complexes are specialised to TSSs relative to other TrxG complexes; and (3) TrxG complex subunits whose functions include establishment of H3K4me3 at bivalent domains and clearing of the repressive H3K27me3 mark are recruited to sites that do not include CFP1.
CFP1 occupancy is associated with specific (epi)genomic features and sequence motifs
Given the specificity with which CFP1/SET1A complexes bind in the genome, we next asked what sequence-level determinants might account for this. Therefore, CFP1 peaks were analysed for motif enrichment using homer [42], and ten optimised motifs, five from CGI-associated CFP1 peaks and five from non-CGI peaks, were identified (Additional file 1: Table S3, see “Methods”). Four of five motifs derived from non-CGI peaks lacked a CpG dinucleotide, and four of these five motifs were also strongly biased or exclusive for either cell type, whereas the fifth was of low complexity (Additional file 1: Table S3); these results demonstrate that CFP1 is binding in a cell type-specific manner in non-CGI regions. On the other hand, all five motifs optimised in CGI-associated CFP1 peaks contained a CpG dinucleotide and were found at high frequency in CFP1 peak-sets in both EBV and ERY cell types (Additional file 1: Table S3). Genome-wide however, an average of only 4% of these five CGI-associated motifs (ranging from 2.9 to 6.4%) fell within CGIs, similar to the genomic fraction of CpGs (7.0%) falling within CGIs. This result emphasises the known importance of the CpG dinucleotide, which is bound by CFP1 through its ZF-CxxC domain [17].
To develop a global model of the features specifying CFP1 genomic occupancy, we then analysed the correlation of CFP1 ChIP signal with a number of genomic signals in addition to genomic CpG density in erythroid cells. We considered: (1) that CFP1 is also known to bind all methylation states of H3K4 through its PHD finger domain [26, 33]; (2) that SET1 proteins have been shown to enhance CFP1 interactions with DNA [17]; and (3) that HCF1 and RBBP5 are additional subunits of this complex [12, 36]. These ChIP intensities and CpG density were averaged in 2-kb windows centred on TSSs and putative enhancers (Figs. 1a, 4 and 5; n = 41,167 non-redundant loci). To test the sufficiency of these features to explain CFP1 binding, we computed the correlation of each feature with CFP1 individually (Additional file 1: Table S4). Across all 41,167 TSSs and enhancers (irrespective of chromatin state), all features except H3K4me1 were correlated with the CFP1 signal (Pearson R2 > 0.1, Additional file 1: Table S4). However, restricting our calculations to TSSs and enhancers having top 10% chromatin accessibility scores (4117 sites, Additional file 1: Fig. S8, see “Methods”), only three features were still highly correlated with the CFP1 signal: H3K4me3, SET1A and CpG density. Individually, these features could explain 28.6, 16.0 and 15.2% of the CFP1 signal, respectively, but incorporated in a linear regression model, these features accounted for 47.7% of CFP1 signal (Additional file 1: Table S5, Additional file 1: Fig. S9). This result shows that these three features contribute mostly additively to CFP1 binding in accessible chromatin regions.