Consistent nucleosomes in the genome of S. cerevisiae
A number of studies have attempted to define the complete set of nucleosome positions along the yeast genome [7, 8, 13, 23, 24]. The final output of these studies varies. Lee et al. [7] applied a hidden Markov model on their raw data, which resulted in a set of genomic coordinates for well-positioned nucleosomes. Shivaswamy et al. [8] directly provided sequence coordinates for the inferred nucleosome positions. Mavrich et al. [23] provided the densest of all datasets, with more than 65,000 nucleosome positions (in the form of genomic coordinates of the pseudodyad axes), but these belonged to a mixed population originating from different experimental approaches (sequencing and tiling arrays), which in addition included hypothetical nucleosomes (that is, positioned by the authors based on general occupancy assumptions). Kaplan et al. [13] assigned a score to each nucleotide in the yeast genome, reflecting the in vitro affinity of the underlying sequence for histone binding. By contrast, Zhang et al. [24] provided only the raw datasets, making it difficult to include their inferred positions in a comparative study.
We thus chose to directly compare the datasets of Lee et al. [7] and Shivaswamy et al. [8], because they both consist of homogeneous datasets that span the complete genome of S. cerevisiae, and to use the data of Mavrich et al. [23] and Kaplan et al. [13] for further validation of our results. The sparseness of unaddressed sequence space in the datasets of Lee et al. [7] and Shivaswamy et al. [8] makes the comparison between them less complicated (see below). In addition, both datasets consist of positions with reference to genomic coordinates, providing a consistent platform for direct comparison. The analysis was carried out as follows.
We connected each of the experimental calls of Lee et al. [7] (40,089 nucleosomes) with its closest corresponding well-positioned nucleosome in the dataset of Shivaswamy et al. [8] (49,043 nucleosomes). For every such pair, we calculated the percentage of overlap, expressed as the percentage of the intersection over the union of their corresponding lengths. We then analysed the distribution of these overlap values (see Additional File 1). We found that 31,234 nucleosomes (78%) of the Lee et al. [7] dataset had at least one nucleotide overlap with one of 33,146 nucleosomes (68%) of the Shivaswamy et al. dataset [8]. In the analysis, we excluded non-overlapping segments to avoid biases originating from local discrepancies between the two datasets (for example, sequencing preferences, experimental biases, or gaps). We thus confined our study to the overlapping nucleosomes between the two datasets. Of these overlapping positions, only 9.8% of the cases (3,061 nucleosomes) had an overlap exceeding 0.95, and 17.8% of the cases (5,560 nucleosomes) had an overlap greater than 0.90. These overlaps are still much larger than expected by chance. To test the significance of these values, we performed 1,000 simulations of two random experiments, with an equal number of nucleosomal calls (see Additional File 1). The overlap values between two random experiments followed a Poisson distribution, which was significantly different from the observed distribution (P < 10-3, Kolmogorov-Smirnov test). In fact, an overlap ratio of 0.95 was not encountered in any of the 1,000 simulations. We should note here that the positions by Shivaswamy et al. [8] were provided at single-nucleotide resolution, whereas those of Lee et al. [7] were given with a resolution equaling that of the chip used, which was four nucleotides. The latter corresponds to ~2.5% of the nucleosome length, which, taken on either side, roughly translates to a 5% uncertainty value for each nucleosome position. We therefore believe that a value of 95% represents the maximum possible overlap, given the combined resolution of the datasets and thus opted for the overlap ratio of 0.95 to define well-positioned nucleosomes that were consistent between the two sets. As these are expected to have a stable position across all cells in the population, we shall refer to them as 'consistent nucleosomes' hereafter.
Consistent nucleosomes are well positioned in vivo and in vitro
A direct comparison of the two previously discussed datasets and their overlapping subsets was then crossvalidated using the scores of Shivaswamy et al. [8], Mavrich et al. [23] and Kaplan et al. [13]. Initial validation of the consistent nucleosomes was conducted through direct comparison with in vivo and in vitro hybridization scores. We used the initial in vivo scores provided by Shivaswamy et al. [8] for each of their nucleosome calls. We then compared the scores of consistent nucleosomes with non-overlapping nucleosomes; that is, those that had zero overlap between the two sets. Scores for nucleosomes with no overlap were significantly lower than for those with high overlap values (see Additional File 1). Because these scores were directly linked to the initial raw experimental signal, the observed differences may be an indication that consistent nucleosomes are positioned more strongly in vivo.
Mavrich et al. [23] also provide a score for each inferred position. This is an occupancy measure based on the mode-normalized occupancy across all of their four sequencing datasets, ranging from 0 to 100; a mean score of 100 represents nucleosomes found at the same position in all four replicate datasets. To further validate our dataset of consistent nucleosomes, we analyzed the cases that had > 0.95 overlap with the dataset of Mavrich et al (~66,000 nucleosomes), and plotted the distributions of mean occupancy for the complete datasets as their overlapping subsets (see Additional File 1). We found that mean occupancy increased sharply as we moved from the initial complete dataset of Mavrich et al. [23] to two-set overlaps, reaching a mean of 93 in the case of the three-set overlap (segments from Lee et al. [7], Shivaswamy et al. [8] and Mavrich et al. [23] that were overlapping in > 0.95 of their combined lengths). In fact, none of the 3061 consistent nucleosomes, defined as the overlaps between Lee et al. [7] and Shivaswamy et al. [8], showed an overlap of < 0.96 with a corresponding segment of the denser Mavrich et al. [23] dataset, an additional indication that they comprise a subset of highly reproducible positions.
In a recent work, Kaplan et al. [13] attempted to decouple the intrinsic sequence preferences of nucleosomes from the combined action of all influencing factors, and thus provided a measure for the in vitro affinity of the underlying DNA for nucleosome formation. Although this is expected to correlate well with in vivo positioning, we examined possible differences between consistent and non-overlapping nucleosomes under in vitro conditions using the model scores as calculated by Kaplan et al. [13] (see Methods for details). We found a significant enrichment of model scores for consistent nucleosomes compared with zero overlap nucleosomes, as expected (see Additional File 1).
Taken together, these results verify our hypothesis that well-defined nucleosomes are positioned more strongly both in vivo and in vitro.
Sequence conservation in consistent nucleosomes
The focus of this work was the investigation of sequence constraints characterizing consistent nucleosomes. We used phastCons sequence conservation scores [26] (see Methods), as a measure of evolutionary conservation between seven species of the genus Saccharomyces, to calculate the average sequence conservation for the genomic regions under study. To avoid biases originating from the background genomic composition, we partitioned all datasets into genic and intergenic nucleosomes. Those falling within genes were treated separately from those occupying non-coding genomic space. Conservation of bulk nucleosomes in the two complete genome datasets [7, 8] is comparable with background conservation in the yeast genome (0.71 for genic nucleosomes, compared with 0.73 for yeast genic regions). Interestingly, consistent nucleosomes show overall lower conservation than the genomic background for both genic and intergenic partitions of the genome (Figure 1). This finding is somehow counterintuitive. Consistent nucleosomes were shown to have enriched affinity for histone molecules as suggested by both in vivo and in vitro experiments and to share particular positional preferences (see Additional File 1). These facts combined would imply an increased functional role for the underlying sequences, which does not easily comport with the observed lack of sequence conservation.
A number of studies [6, 23, 27] have pointed out the role of NFRs in gene regulation. Such regions are usually enriched in regulatory cis-acting binding elements for transcription factors that exclude the positioning of nucleosomes and could thus indirectly affect the organization of chromatin. These NFRs are therefore expected to be under specific sequence constraints, and related to the positioning of nucleosomes. To test whether such regions may be related to our consistent nucleosomes (as defined above), we extracted the genomic regions that fell between two adjacent consistent nucleosome segments We defined as 'adjacent' two consistent nucleosomes separated by a distance less than the size of a nucleosome and termed their intervening Nucleosome-Free Regions as 'consistent NFRs'. The set of these regions consisted of 1,099 genomic segments with an average length of ~40 nucleotides. The small percentage (~10%) that had a length of > 100 nucleotides suggested that they probably represent natural nucleosome linker sequences; that is, short spacers between two juxtaposed well-positioned nucleosomes. There has been evidence that NFRs of greater length may accommodate specific nucleosomes, whose unstable nature obstructs their direct observation [28]. We thus restricted our definition of consistent NFRs to consider only regions shorted than a full nucleosome. The relative enrichment (1,099 pairs out of 3,061) of such short linkers occurring between consistent nucleosomes reflects an additional property of the latter, namely, their preference to appear in pairs of dinucleosomes (see also Additional File 1).
As expected, consistent NFR segments showed increased sequence conservation. PhastCons scores calculated for these regions showed an average sequence conservation above the yeast genomic background for both genic and intergenic regions. Although strong sequence constraints are known to exist in these regions, this extremely high conservation of consistent NFR suggests an increased importance at various levels. Overall, the 1.099 defined segments represented less than 0.5% of the length of the total genome, and even if this is an underestimate due to restricted coverage of the analyzed datasets, their extent is unlikely to exceed 1% of the total genome size. On the one hand, the extremely high conservation of these sequences indicates multiple roles, which may include the guiding of nucleosome positioning (see below). On the other hand, their overall rarity suggests that signals of a different type (possibly lying within nucleosomes) are probably also necessary even though they may not be reflected in the primary sequence conservation.
Possible role of consistent nucleosomes in gene regulation
The proximity of consistent nucleosomes to highly constrained NFRs that carry regulatory elements suggests that stable nucleosome positioning may be functionally related to the nearby binding of transcription factors. In fact, NFRs that host transcription factor binding elements have been proposed to act as nucleosome positioning markers [6, 23]. If this holds, positional preferences defining a mutual exclusion between transcription factor binding sites and nucleosomes would be expected. To test this, all consistent nucleosomes were centrally aligned and extended in both directions for 100 nucleotides. The resulting sequences (347 nucleotide long) were then analyzed with PEAKS [29], an application to identify significant motif positional biases in pre-aligned DNA sequences. We performed a PEAKS search of yeast-specific regulatory motifs as defined by Harbison et al. [30]. The results of this analysis revealed a significant over-representation of positional biases for regulatory motifs, which are predominantly confined to the flanking regions of consistent nucleosomes, in clear contrast with the absence of significant motifs in the centre of the sequence corresponding to the nucleosome (Figure 2). The biased motifs belonged to a number of transcription factors and had variable sequence compositions, with both AT-rich (TBP) and GC-rich (Ume-6, Reb1, Swi4) sites found among them, suggesting that these biases are partly independent from sequence composition and probably reflect more general structural or/and functional tendencies.
Identical analysis was performed on a set of nucleosomes that was defined as the low overlap (< 5% but at least one overlapping nucleotide) fraction of the two datasets [7, 8]. This set comprised 3,288 segments obtained from the Lee et al. [7] study, which we termed 'weakly positioned nucleosomes'. Inspection of the PEAKS output suggested much weaker positional biases for consistent nucleosomes. Only a few significant motif biases that lacked specific positional preference with respect to the nucleosome position were present.
Structural constraints of consistent nucleosomes
A number of studies have suggested the existence of a 'nucleosome code' [11–13, 16, 31], but existing models appear to explain the binding of only a part of the complete set of nucleosomes. However, genome-scale studies on nucleosome formation in more complex eukaryotes [32, 33] have reported limited sequence constraints in nucleosome sequences. It remains to be resolved whether this apparent lack of constraint is only an artifact of limited knowledge or due to inherent attributes of the dynamics of the process.
The results presented thus far suggest an overall absence of primary sequence constraints in the form of sequence conservation even in well-positioned nucleosomes. Nonetheless, even under an assumed statistical model, some degree of constraint would be expected in those positions that are thought as instrumental for the overall nucleosome-positioning process. In the case of nucleosomes, which are particles involved in chromatin structure and conformation, it is plausible to expect that constraints may be better reflected within the DNA structure rather than in the primary sequence itself. We therefore introduce a concept of structural information, which is directly related to the expected curvature of a DNA sequence.
The importance of curvature in the formation of nucleosomes has been demonstrated by several studies [34–37]. The DNA sequence spooling the histone octamer is curved, and it is plausible to assume that specific curvature patterns may be advantageous for a nucleosome-forming sequence. Another important aspect of nucleosomes is their inherent symmetry. The 147 bp nucleosomal sequence wraps around the octamer in a symmetric manner, and its central part is associated with the octamer's pseudodyad axis, a region that has been shown to bear structural features distinguishing it from its flanking regions [34, 38–40]. Starting from the above observations, we hypothesized that nucleosomal DNA sequences are likely to consist of: (i) two symmetrically curved regions flanking the central part, as DNA is expected to be curved in a similar manner on either side of the pseudodyad axis; and (ii) a region of higher flexibility associated with the pseudodyad axis, at which the two superhelical loops have to meet and spool the histone octamer in a more distorted conformation. If these properties are to be reflected in the underlying nucleosomal DNA structure, then a measure that efficiently quantifies them should reveal structurally related differences between nucleosome-binding and nucleosome-free sequences.
In a recent work [41], we introduced a structural property that quantifies the strength of the pattern described above. We implemented this concept ('symmetry of DNA curvature') in a suitably adjusted method (SymCurv), which permits us to attribute a symmetry of curvature score to each nucleotide on a given DNA sequence (see Methods for details). We calculated SymCurv for the complete genome sequence of S. cerevisiae, and went on to check whether consistent nucleosomes showed the expected tendency for high SymCurv values. The results are shown in Figure 3a. Consistent nucleosomes were found to have significantly higher SymCurv values compared with their adjacent linker sequences, whereas weakly positioned (unstable) nucleosomes (that is, those with < 5% overlap between the two sets) were shown to have significantly lower values than the consistent nucleosomes. A more thorough examination was conducted after centrally aligning all consistent and weakly positioned nucleosomes, and calculating the average SymCurv value over each nucleotide position with respect to the centre of the assumed nucleosome (Figure 3b). Consistent nucleosomes had the expected low-high-low SymCurv pattern, which reflects a strongly positioned nucleosome in the centre, flanked by two NFRs. Weakly positioned nucleosomes significantly deviated from this pattern, showing a wider peak with a 'fuzzier' shape with lower SymCurv values and with less pronounced boundaries. The SymCurv pattern for consistent nucleosomes is therefore suggestive of specific structural preferences related to nucleosome positioning, which are not reflected in the sequence conservation.
To better establish the existence of this structural constraint, we went on to test whether SymCurv values of consistent nucleosomes are susceptible to minor changes in the primary sequence. If structural constraints exist, it would be expected that small changes, even point mutations, would bring about significant changes in the structural profile of the sequence, thus the sequence will have low structural robustness. We formulated a simple measure of robustness for SymCurv, based on the variance of SymCurv values over all one-nucleotide neighbors of a given sequence (see Methods for details). A sequence under strong structural constraints is expected to exhibit high variance, and therefore its robustness will be low. We calculated SymCurv robustness values for the three sequence classes under study and plotted the corresponding distributions (Figure 3c). Consistent nucleosomes had robustness values similar to those NFRs, and were significantly lower than those of weakly positioned nucleosomes. The low robustness values of both consistent nucleosomes and NFRs are indicative of strong structural constraints, even though their profiles in terms of SymCurv values are significantly different. In the case of NFRs, these constraints coexist with those of the primary sequence as indicated by high sequence conservation, which may offer an explanation for their extremely high conservation, reflected in very high average phastCons values (Figure 2). By contrast, consistent nucleosomes also appear to be under structural constraints, but at the same time lack significant conservation at the level of primary sequence. This may be an additional indication for the existence of a weak structural code for nucleosome positioning, which is only indirectly connected to the DNA sequence.