Schematic overview of EPIGENE
EPIGENE uses a multivariate HMM, which allows the probabilistic modelling of the combinatorial presence and absence of multiple IHEC class 1 histone modifications. It receives a list of aligned ChIP and control reads for each histone modification, which is subsequently converted into presence or absence calls across the genome using normR (see “Binarization of ChIP-seq profiles” section; Fig. 1a (i)). By default, TU states were analysed at 200-bp non-overlapping intervals called bins. The HMM comprises 14 TU states and 3 background states where each TU state captures individual elements of a gene (i.e. TSS, exons, introns, and TTS). The TU state sequence was duplicated, running from TSS to TTS and from TTS to TSS, allowing identification of TUs on the forward and reverse strand, respectively (see Fig. 1a (ii)). The transition probabilities between the TU states were trained in a supervised manner using GENCODE annotations [41] and their emission probabilities were trained on a highly confident set of GENCODE transcripts [41] that showed an enrichment for RNA Polymerase II in K562 cell line (see “Training the model parameters” section). The transition and emission probabilities of background states, the transition probabilities from or to either the TSS or TTS state, and the transition probabilities between TSS and TTS states were trained in an unsupervised manner (see “Training the model parameters” section). The HMM outputs a vector where each bin is assigned to a TU state or to one of the three background states. This vector is then further refined to obtain active TUs (see Fig. 1b).
Validation with existing gene annotations and RNA-seq
We validated the predicted TUs using existing gene annotations and RNA-seq evidence. For this, we combined the EPIGENE predictions (24,571 TUs) and RNA-seq predictions that were obtained from Cufflinks (32,079 TUs) and StringTie (101,656 TUs; Additional file 1: Tables S2–S4 for summary statistics) to generate a consensus TU set. This consensus TU set contains 24,874 TUs, which were then overlaid with GENCODE and CHESS gene annotation [41, 42] (Fig. 2). We found that 93% of EPIGENE TUs can be explained by existing gene annotations. We identified 14,797 (11,584: annotated, 3213: unannotated) RNA-seq-exclusive TUs and 1304 (718: annotated, 586: unannotated) EPIGENE-exclusive TUs. Additional integration of RNA Polymerase II ChIP and Nascent RNA-seq data revealed that 40% (232 out of 586 TUs) of EPIGENE unannotated TUs and 35% (1120 out of 3213 TUs) of RNA-seq unannotated TUs showed enrichment of RNA Polymerase II ChIP, TT-seq, and GRO-seq evidence. Also, 88.4% (518 out of 586 TUs) could be validated by either RNA Polymerase II ChIP or Nascent RNA-seq. Additional details about RNA Polymerase II ChIP and Nascent RNA-seq enrichment in the consensus TU set can be seen in Additional file 2: Table S5.
Histone modifications and RNA Polymerase II occupancy
The correctness of predicted TUs was estimated in K562 cell line, due to the availability of matched RNA Polymerase II and RNA-seq profiles. We predicted 24,571 TUs in K562 majority of which showed typical gene characteristics, with high enrichment of H3K27ac, H3K4me3 and H3K36me3 in TSS and gene bodies (Fig. 3a).
It is known that eukaryotic transcription is regulated by phosphorylation of RNA Polymerase II carboxy-terminal domain at serine 2, 5 and 7. The phosphorylation signal for serine 5 and 7 is strong at promoter region, whereas signal for serine 2 and 5 is strong at actively transcribed regions [43]. Genome-wide RNA Polymerase II profile for K562 cell line was obtained using four antibodies (see “Library preparation of RNA polymerase II ChIP-seq” section) that capture RNA Polymerase II signal at transcription initiation and gene bodies. The enrichment of RNA Polymerase II in predicted TUs was computed using normR [44] (see “Binarization of ChIP-seq profiles” section). The predicted TUs were classified as having high or low RPKM based on mRNA levels (threshold = upper quartile). Figure 3b shows the distribution of RNA Polymerase II enrichment in both the classes of predicted TUs. We observed that a significant proportion of predicted TUs (78%) showed an enrichment of RNA Polymerase II and thus were likely to be true positives. We also came across 24 unannotated TUs that showed an enrichment of RNA Polymerase II (enrichment score above 0.5), but had reduced or no RNA-seq evidence.
Comparison with RNA-seq-based approaches
Currently, there is no gold standard set of true TUs. However, there is a plethora of experimental approaches for studying RNA Polymerase II transcription. In order to perform an unbiased comparison, we integrated RNA Polymerase II data from ChIP-seq and Nascent RNA-seq techniques. For individual cell lines, we defined a set of gold standard regions based on RNA Polymerase II ChIP-seq and Nascent RNA-seq evidence (see Fig. 4a). We compared the performance of EPIGENE with two existing RNA-seq based transcript prediction approaches, Cufflinks and StringTie, both of which are known to predict novel TUs in addition to annotated TUs. The method comparison was performed in two stages: within-cell type and cross-cell type comparison using RNA Polymerase II ChIP-seq and Nascent RNA-seq enrichment as performance indicator (see “Performance evaluation” section, Fig. 4b).
Within-cell type comparison
For this comparison, we used the ChIP-seq profile of RNA Polymerase II in K562 cell line and the pre-existing nascent RNA TUs reported by Schwalb et al. [17] as performance indicator (see “Binarization of Nascent RNA-seq profiles” and “Performance evaluation” sections). The nascent RNA TUs have been reported to show an enrichment of TT-seq and GRO-seq [17]. The ChIP-seq profiles of RNA Polymerase II were obtained using PolIIS5P4H8 antibody because it can enrich RNA Polymerase II both at the TSS and in actively transcribed regions.
We performed the method comparison at 200-bp resolution and found that EPIGENE reports in both the precision–recall curve (PRC) and the receiver-operating characteristic (ROC) curves a higher AUC (PRC: 0.83, ROC: 0.85; Fig. 4c, d) compared to Cufflinks (PRC: 0.60, ROC: 0.63) and StringTie (PRC: 0.77, ROC: 0.82). We repeated this analysis for three different resolutions (50, 100, and 500 bp) and the corresponding AUC values are in Fig. 4e. Cufflinks achieved a lower AUC compared to StringTie and EPIGENE, which is likely due to the usage of the RABT assembler which results in large number of false positives [45].
StringTie reported a lower AUC than EPIGENE for varying RNA Polymerase II resolutions. We examined the precision, sensitivity, and specificity values for EPIGENE, Cufflinks, and StringTie and found that the lower AUC for RNA-seq-based methods was due to spurious read mappings of RNA-seq that results in higher false positives in StringTie and Cufflinks. Additional file 1: Figure S1 shows an example of Cufflinks and StringTie TU that was identified due to spurious read mapping. This TU exactly overlaps with a repetitive sequence that occurs in four chromosomes (chromosome 1, 5, 6, X).
Cross-cell type comparison
For this comparison, we used three different datasets provided by the GEO database [46], ENCODE [31], and DEEP [33] consortium:
-
1.
IMR90: lung fibroblast cells with 6 histone modifications obtained from Lister et al. [47], one RNA Polymerase II obtained from Dunham et al. [48], two control experiments (one each for RNA Polymerase II [48] and histone modifications [47]), one RNA-seq obtained from Dunham et al. [48] and one GRO-seq profile obtained from Jin et al. [49],
-
2.
HepG2 replicate 1 and HepG2 replicate 2: hepatocellular carcinoma with 6 histone modifications, one control experiment and one RNA-seq obtained from Salhab et al. [50] where two replicates per histone modification and RNA-seq were available, RNA Polymerase II ChIP and control experiments obtained from Dunham et al. [48] and one GRO-seq obtained from Bouvy-Liivrand et al. [51].
We applied the K562-trained EPIGENE model to IMR90 and HepG2 datasets and compared the predictions with Cufflinks and StringTie. The ChIP-seq profiles of RNA Polymerase II and GRO-seq profiles were used as performance indicator for both cell lines (see “Binarization of Nascent RNA-seq profiles” and “Performance evaluation” sections). As shown in Fig. 5 and Additional file 1: Figure S2, the K562-trained EPIGENE model consistently reports a higher AUC (PRC: 0.78, ROC: 0.77 in IMR90; PRC: 0.75, ROC: 0.77 in HepG2 replicate 1; PRC: 0.80, ROC: 0.80 in HepG2 replicate 2) compared to Cufflinks (PRC: 0.54, ROC: 0.54 in IMR90; PRC: 0.61, ROC: 0.64 in HepG2 replicate 1; PRC: 0.61, ROC: 0.64 in HepG2 replicate 2) and StringTie (PRC: 0.68, ROC: 0.72 in IMR90; PRC: 0.73, ROC: 0.77 in HepG2 replicate 1; PRC: 0.73, ROC: 0.78 in HepG2 replicate 2). These results suggest that EPIGENE generates accurate predictions across different cell lines, outperforming RNA-seq-based methods.
Comparison with chromatin segmentation approaches
Currently several chromatin segmentation approaches (like ChromHMM and Segway) exist that provide chromatin state annotation using histone modifications. These approaches were inherently designed to provide a whole-genome chromatin state annotation and hence, the model parameters do not represent a specific topology. We examined the results of these approaches to evaluate their accuracy in identifying TUs.
We compared EPIGENE predictions with a widely used chromatin segmentation approach, ChromHMM, as both methods use a binning scheme. We did not include Segway in this comparison because it operates at single base pair resolution and, therefore restricts fair comparison of different profiles. Additionally, Segway is quite slower than chromHMM.
TU identification with chromHMM was performed in two modes: strand-specific and unstranded. Strand-specific TUs were obtained by linking the promoter and transcription elongation states. We defined TU as a genomic region that begins with promoter state and proceeds through transcription elongation states. A promoter state was defined by an enrichment of H3K4me3 and H3K27ac (state 9 in Fig. 6a) and an elongation state was defined by an enrichment of H3K36me3 (state 4, 5 and 8 in Fig. 6a). Unstranded TUs were obtained by filtering chromHMM segmentations for transcription elongation states (state 4, 5 and 8 in Fig. 6a). The comparison was performed using the gold standard regions defined in “Comparison with RNA-seq based approaches” section. As shown in Fig. 6b–e and Additional file 1: Figure S3, EPIGENE consistently performed better (K562; ROC: 0.85, PRC: 0.83) than chromHMM strand-specific (K562, ROC: 0.73, PRC: 0.77) and unstranded TUs (K562, ROC: 0.79, PRC: 0.80). The lower AUC of strand-specific and unstranded chromHMM TUs was due to the presence of intronic enhancers and intermediate low coverage regions that resulted in fewer strand-specific chromHMM TUs and shorter strand-specific and unstranded chromHMM TUs (see Additional file 1: Figure S4).
EPIGENE identifies transcription units with negligible RNA-seq evidence
Previous analyses (see “Histone modifications and RNA Polymerase II occupancy” and “Comparison with RNA-seq based approaches” sections) indicated the presence of TUs supported by RNA Polymerase II evidence but with reduced or no RNA-seq evidence. Here, we evaluated these TUs within and across cell lines by: (a) identifying cell type-specific TUs that showed TU characteristics but lack RNA-seq evidence, and (b) analysing the presence of microRNAs that were not identified by RNA-seq.
EPIGENE identifies cell type-specific transcription units
We created a consensus set of TUs by overlaying the EPIGENE predictions for K562, HepG2 and IMR90. This consensus TU set comprised 18,248 TUs, of which ~ 78% showed an enrichment for RNA Polymerase II. We identified 10,233 differential TUs of which 8047 were exclusive to cell lines (K562: 4247, IMR90: 2545, HepG2: 1255; see Additional file 1: Figure S5). We additionally identified 43 high-confidence cell-specific TUs (K562: 24, IMR90: 17, HepG2: 2; additional details in Additional file 3: Table S6), that lacked RNA-seq evidence but had typical characteristics of a TU, with RNA Polymerase II and GRO-seq enrichment at TSS and transcribed regions, H3K4me3 and H3K27ac enrichment at the TSS, and H3K36me3 enrichment in gene body (Fig. 7).
Identifying microRNAs that lack RNA-seq evidence
MicroRNAs are small (~ 22 bp), evolutionally conserved, non-coding RNAs [52, 53] derived from large primary microRNAs (pri-miRNA), that are processed to ~ 70 bp precursors (pre-miRNA) and consequently to their mature form by endonucleases [54, 55]. They regulate various fundamental biological processes such as development, differentiation, or apoptosis by means of post-transcriptional regulation of target genes via gene silencing [56, 57] and are involved in human diseases [58]. Due to the unstable nature of primary microRNA, traditional identification approaches relying on RNA-seq are challenging. Here, we investigated the presence of primary microRNAs that lack RNA-seq evidence across cell lines. We created a consensus TU set for individual cell lines (K562, HepG2 and IMR90) by combining EPIGENE and RNA-seq-based predictions. The RNA-seq-based predictions were obtained from Cufflinks and StringTie. The consensus TU set was overlapped with miRbase annotations [59] to obtain potential primary microRNA TUs. We identified 655 EPIGENE TUs in HepG2 (5% of total EPIGENE TUs common in both HepG2 replicates) that could be explained by miRbase annotations. We observed that majority of these were supported by RNA-seq and Polymerase II evidence (Fig. 8a and Additional file 1: Figure S6). We additionally identified 2 primary microRNA TUs in HepG2 cell line, which showed an enrichment for H3K27ac and H3K4me3 at their promoters, H3K36me3 in their gene body, and RNA Polymerase II in TSS and transcribed regions while lacking RNA-seq evidence. One of these TUs overlapped with a microRNA cluster located between RP-11738B7.1 (lincRNA) and NRF1 gene (Fig. 8b). This microRNA cluster has been shown to arise from the same primary miRNA and is also known to promote cell proliferation in HepG2 cell line [60, 61].
Discussion
In this work, we introduced EPIGENE, a semi-supervised HMM that identifies active TUs using histone modifications. EPIGENE has TU (forward and reverse) and background sub-models. The TU sub-models were trained in a supervised manner on predefined training sets, whereas the background was trained in an unsupervised manner. This semi-supervised approach captures the biological topology of active TUs as well as the probability of occurrence of histone modifications in different parts of a TU.
We first showed that majority of the predicted TUs can be explained by existing gene annotations and were supported by RNA Polymerase II evidence. A quantitative comparison with RNA-seq revealed the presence of TUs with RNA Polymerase II enrichment but negligible RNA-seq evidence. Considering RNA Polymerase II ChIP-seq and Nascent RNA-seq as true transcription indicator, we compared the performance of EPIGENE with chromatin segmentation approach chromHMM and two RNA-seq-based approaches Cufflinks and StringTie. Based solely on the AUC of PRC and ROC curve as performance measure, EPIGENE achieves a superior performance than chromatin segmentation and RNA-seq-based approaches. We further showed that EPIGENE can be reliably applied across different cell lines without the need for re-training the TU states and accomplishes a superior performance than RNA-seq-based approaches.
We examined other performance scores like precision, sensitivity, and specificity values, and observed that the low AUC of RNA-seq-based approaches is due to RNA-seq mapping artefacts that resulted in higher number of false positives in Cufflinks and StringTie. We further evaluated the presence of differentially identified TUs in K562, HepG2, and IMR90 cell lines that lack RNA-seq evidence. The results suggested the presence of cell line-specific transcripts that lack RNA-seq evidence. We additionally identified potential microRNA precursors that lacked RNA-seq evidence presumably due to their instability. All of the aforementioned TUs showed an enrichment of RNA Polymerase II in TSS and gene body indicating that they had been transcribed.
It is important to note that EPIGENE does not differentiate between functional and non-functional units of a TU (exons and introns) as the association between histone modifications and alternative splicing is yet to be elucidated [62]. However, EPIGENE identifies active TUs with high precision as shown in “Comparison with RNA-seq based approaches” section and in the example regions presented in this work.
EPIGENE uses six core histone modifications that are available for many cell lines and species, which leads to a broad applicability. All the core histone modifications are essential for accurate TU identification, as the accuracy of TU prediction decreases in the absence of any of the core histone modification. In the absence of a core histone modification, imputation techniques such as ChromImpute [63] and PREDICTD [64] can be used to impute the missing histone modifications at 200-bp resolution and then use the imputed histone modification together with the available histone modifications to obtain active TUs. The accuracy of EPIGENE predictions also depends on the sequencing depth of the input histone modifications, therefore, high-quality ChIP-seq profiles of histone modifications would result in high confident TU annotation.
In summary, the superior performance within and across cell lines, identification of TUs, especially primary microRNAs lacking RNA-seq evidence as well as interpretability makes EPIGENE a powerful tool for epigenome-based gene annotation.