Single-nuclei chromatin profiles of ventral midbrain and identification of major cell types in two mouse strains
To unravel the cell-type-specific gene regulation in midbrain, and how it impacts genetic regulatory variation, we performed ATAC sequencing at single-nuclei level (snATAC-seq) on dissected midbrains from two genetically distinct mouse strains, C57BL/6J and A/J (Fig. 1, Additional file 1: Figure S1). Perfused ventral midbrain sections from two hemibrains of one mouse for both strains were used for the partitioning and barcoding with a total of 13,640 and 13,259 nuclei from C57BL/6J and A/J, respectively, and following high throughput sequencing. After filtering of multiplets and nuclei with low coverage, approximately 290 million reads per mouse strain were retained, corresponding to 10,298 (C57BL/6J) and 10,360 (A/J) individual accessibility profiles (Fig. 1A). The bulk chromatin accessibility profile aggregated across single nuclei (bulk snATAC-seq) from C57BL/6J showed a total of 231,390 peaks. Notably, 99.7% of regular bulk ATAC-seq peaks obtained from an independent C57BL/6J midbrain section overlapped with bulk snATAC-seq peaks (Additional file 1: Figure S1). Moreover, the bulk snATAC-seq profile from A/J with 235,157 peaks was also highly correlated with C57BL/6J profile (Pearson R > 0.97). Finally, to distinguish accessible regions at enhancers and promoters actively engaged in transcriptional control, we performed ventral midbrain bulk level ChIP-seq analysis in both mouse strains for histone H3 lysine 27 acetylation (H3K27ac) [21, 22]. Both bulk ATAC-seq and H3K27ac ChIP-seq showed clear correlation with midbrain gene expression levels (Additional file 1: Figure S2).
Using an existing single cell genomics toolkit [23, 24], the dimensionality of snATAC-seq was calculated by performing latent semantic indexing (LSI), to allow clustering of the cells with uniform manifold approximation and projection (UMAP) (Fig. 1A). A gene activity matrix of snATAC-seq was established by counting reads in the gene body and the promoter region [2 kb upstream of transcription start site (TSS)]. To annotate the obtained clusters as individual cell types, we took advantage of existing single cell RNA sequencing (scRNA-seq) of mouse midbrain [8]. Through identification of anchor genes shared between the gene activity matrix of snATAC-seq and the highly variable features in scRNA-seq, we could identify 10 different midbrain cell types with distinct chromatin accessibility profiles and sufficient numbers of cells in both strains (Fig. 1A, Additional file 2: Table S1). The cell types are grouped into six main clusters consisting of glial cells, such as astrocytes (13.5%), microglia (4.2–5.6%), oligodendrocytes (14.–19.5%), and two subtypes of polydendrocytes (Tnr+ and Tnr+/Cspg5+) (3.2–3.9%), two different types of endothelial cells (stalk and tip; 1.1–3.4%), and the largest and most diffuse cluster neurons making up more than half of all cells (57.1–60.8%). Although scRNA-seq data could distinguish up to 30 neuronal subtypes in midbrain through combinations of marker gene abundances [8], at chromatin accessibility level these could not be clearly distinguished. Instead, only three classes of neuronal cell types could be well distinguished: thalamus glutamatergic neurons (referred to as Slc17a6+ neurons), dopaminergic neurons (referred to as Th+ neurons), and a broader group of neurons consisting to large extent, but not exclusively, from different Gad2+ GABAergic neurons (referred to simply as neurons).
Interestingly, an increased proportion of Th+ and Slc17a6+ neurons and decreased proportions of oligodendrocytes and macrophages could be detected in A/J samples compared to C57BL/6J, while the proportion of astrocytes and Tnr+/Cspg5+ polydendrocytes remained almost identical (Fig. 1A).
Inspection of genomic loci encoding for known cell-type-specific marker genes in C57BL/6J samples disclosed highly cell-type-selective chromatin accessibility that was well correlated with gene expression levels in scRNA-seq data of mouse midbrain (Fig. 1B). While ubiquitously expressed Rpl13a gene had high and consistent levels of accessibility across the cell types, known marker genes for astrocytes (Aldh1l1) and microglia (Tmem119 and Selplg) [25, 26] were expressed and most accessible in the respective cell types, especially at their TSS. Similarly, the gene encoding for dopamine transporter (Slc6a3) had highest levels of expression and accessibility in the Th+ neurons, while in other neurons almost no signal could be detected. At the same time, the adjacent Clptm1l gene harboured an accessible promoter in all of the cell types. Finally, the locus encoding for two TFs required for oligodendrocyte generation and maturation, Olig1 and Olig2 [27, 28], showed highest accessibility in subtypes of polydendrocytes and oligodendrocytes, as well as astrocytes, again consistent with the gene expression levels.
Importantly, the accessibility profiles between C57BL/6J and A/J were highly comparable also at the level of individual cell types and could equally highlight cell-type-specific accessibility consistent with gene expression levels, as shown in Additional file 1: Figure S1 for Aif1, a known marker gene for microglia.
Taken together, our snATAC-seq profiling produced over 20,000 chromatin profiles of mouse midbrain cell types with comparable quality from two different mouse strains. These data allow the distinction of 10 different midbrain cell types at epigenomic level that are consistent with known gene expression profiles.
Identification of cell identity genes and associated regulatory regions from single cell data
To leverage the available data for the identification of TFs controlling cellular identity in adult midbrain cell types, we first set out to determine the genes whose expression was selective for each cell type. To obtain these cell identity genes, we used the existing scRNA-seq of the mouse midbrain, and for each gene determined the 85th percentile of its expression across all cell types indicating gene specific “high expression” (Fig. 2). To filter out genes being selectively expressed in a specific cell type, at least 60% of the cells of that cell type have expression not less than the 85th percentile. Furthermore, to ensure uniqueness, no other cell type was permitted to have the same gene among its top expressed genes (as defined by their 85th percentile) in more than 40% of the cells. Through this approach we could define between 47 and 412 identity genes for each cell type (Fig. 2; Additional file 3: Table S2). On average, 170 genes per cell type were determined. To confirm the relevance of the genes to the biology of the cell type in question, GO enrichment analysis for biological processes was performed. In keeping with the genes’ role in the molecular and biological identity of the cell types, the top enriched GO terms included: Positive regulation of angiogenesis for endothelial stalk cells; Neutrophil mediated immunity for microglia; Axonogenesis, Neurotransmitter transport, and Regulation of synaptic vesicle exocytosis for the different neurons; Septin ring assembly and Myelination for oligo- and polydendrocytes; and Negative regulation of neuron differentiation for astrocytes. Examples of gene expression profiles of identity genes from selected cell types are shown in Fig. 2B. Full list of enriched GO terms are provided in Additional file 4: Table S3.
Next, to determine gene regulatory regions controlling the expression of cell identity genes, we performed peak calling on cell-type-specific aggregate ATAC-seq signals and associated the peaks to the defined identity genes of the respective cell types using GREAT (basal regulatory region ± 100 kb from TSS or up to nearest gene [29]). This resulted in 100–1200 accessible regions likely to control cell identity gene expression in each cell type (Fig. 2; Additional file 5: Table S4).
Cell-type-specific chromatin accessibility profiles uncover cell identity regulating transcription factors
Comparison of the chromatin accessibility levels across the cell types confirmed a clear increase in accessibility at the obtained cell identity peaks associated with respective cell identity genes (Fig. 3A). The highest increase in signal over background of aggregated midbrain cells was always detected in the corresponding cell types expressing the associated identity genes. At the same time, depletion of signal could be detected in other cell types. Interestingly, the level of accessibility also reflected the developmental relationships of the cell types. The strongest depletion of signal could be detected in the developmentally most distant cell type, microglia [30]. Consistently, microglia identity peaks showed the strongest depletion in all other cell types. In contrast, neuron identity peaks showed no major depletion of signal in the related Th+ neurons and vice versa. Altogether, our approach could accurately identify cell-type-specific gene regulatory regions controlling cell-type identity.
To identify TFs binding the regulatory regions and controlling cell-type-identity genes, we performed TF binding motif analysis in sequences enriched at cell identity peaks (Fig. 3B). This analysis was done for eight cell types with the highest sequencing coverage. Importantly, the analysis highlighted motifs for several TFs previously shown to control the differentiation or identity of the respective cell types. These included SOX9 in astrocytes [31], SPI1 in microglia (also known as SFPI1 or PU.1 [32, 33]), SOX17 in endothelial stalk [34], and SOX10 and SOX8 in oligodendrocytes [31, 35]. The most enriched motif across cell types was the shared binding site for CTCF and CTCFL, the sequence occurring at insulator regions, where CTCF mediates chromatin looping events [9], indicating the formation of cell-type-selective chromatin looping and topological domains at the loci of cell identity gene loci.
In addition, NFI-family motif was enriched in astrocytes and polydendrocytes, consistent with its reported role in the transition from neurogenesis to gliogenesis [36] and the requirement of NFIC for expression of astrocyte marker genes [37]. Motifs enriched in microglia included MAF motif which can be bound by MAFB, a TF recently shown to be important for maintenance of homeostasis in adult microglia [38]. Interestingly, RFX-family motif was highly enriched in both astrocytes and neurons. Indeed, Rfx1, Rfx3, Rfx4, and Rfx7 are known to be expressed and to play a role in the brain [39], with Rfx4 showing the strongest expression in astrocytes, while Rfx3 and Rfx7 are abundant in different neurons [8]. Thus, our data warrant further investigation of individual TFs in the RFX-family in the cellular identity of midbrain neurons and astrocytes.
Together with RFX family, another TF with enriched motif in neurons, ZNF740, also has been shown to localize at gene enhancers active specifically in differentiated human neuronal cell lines, further supporting the relevance of this prediction across species [40]. Finally, the motifs enriched uniquely in Th+ neurons included binding sites for KLF family TFs, MEF2 TFs, and ZBTB7 TFs (that share their core motif with WT1 and ZNF263) (Fig. 3B). From these particularly Klf9, Mef2a, Mef2d, and Zbtb7c show high expression in Th+ neurons [8], with Mef2d exhibiting the most selective expression, an observation that could guide more detailed experiments into their role in dopaminergic neuron identity.
Genetically driven chromatin accessibility changes reveal cell-type-specific gene expression changes
We have recently shown that the midbrain phenotypic differences between C57BL/6J and A/J (and associated behavioural changes) are accompanied by extensive gene regulatory variation [20]. Based on tissue-level bulk RNA-seq analysis of 12 independent mice per strain, 1151 genes are significantly differentially expressed (> twofold, FDR < 0.05) in the ventral midbrains between the two strains (Fig. 4A). However, the cis- and trans-acting mechanisms underlying these genetically driven changes, and the affected cell types, are not known.
To address the contribution of chromatin level changes to the gene expression variation, we compared the bulk snATAC-seq signals from the mouse strains and focused on top differential peaks (FDR < 0.05, top 1% of logFC) (Fig. 4A). The extent of fold changes at the chromatin accessibility level was more modest than at the transcriptomic level. Nevertheless, 1317 of 263,709 called peaks were significantly altered at the bulk level. By associating accessible regions to their target genes, we could observe a significant enrichment of such regions at the differentially expressed genes. This indicates at least some of the gene expression changes could be linked to changes at the chromatin level. Observing the data at the level of individual cell type allowed detection of additional cell-type-specific differentially accessible regions, indicating that cell-type-specific changes from rare cells could be masked by tissue level analysis. The significant enrichment of the differential peaks at differentially expressed genes held in all cell types except dopaminergic neurons which might result from their low sequencing coverage (Fig. 4A).
For genes, such as Isoc2b, the decreased gene expression in ventral midbrain of A/J was accompanied by reduced accessibility of the promoter across all cell types (Fig. 4B). To confirm that the lost accessibility was also accompanied by reduced transcriptional activity, we observed H3K27ac levels at the promoter. Consistent with reduced ATAC-seq signal, H3K27ac was also lost at Isoc2b locus in A/J.
For ubiquitously expressed genes such as Isoc2b the altered gene expression could be associated with chromatin level changes even at bulk level analysis. However, for other genes, such as Olfr287, a reduced expression could be observed by RNA-seq although no signal was detectable at bulk chromatin level by any of the methods (bulk ATAC-seq, bulk snATAC-seq, and ChIP-seq). Still, when observing the cell-type-specific snATAC-seq data, an accessible region could be detected at Olfr287 promoter specifically in astrocytes. In addition, consistently with reduced gene expression, the chromatin was less accessible in A/J (Fig. 4B).
In summary, gene regulatory variation in midbrain is associated with chromatin level changes in accessibility, although not at all loci and with lower sensitivity than in transcriptomic analysis. Interestingly, snATAC-seq can reveal cell-type-specific regulatory changes not captured in bulk level analysis.
Putative cis-acting variants are enriched at midbrain regulatory regions of differentially expressed genes
To obtain further insight into the mechanisms underlying the strain-specific gene expression, we next set out to address the extent of cis-acting regulatory variation contributing to the observed differences in the midbrain. For this we focused on identification of putative midbrain regulatory variants segregating C57BL/6J and A/J. We first performed TF footprint identification on our midbrain chromatin accessibility profile obtained through the bulk ATAC-seq analysis. Then, these footprints were overlapped with > 6 million variants segregating C57BL/6J and A/J to identify those with the potential to disrupt TF binding. Finally, the binding sites were overlapped with the midbrain H3K27ac profiles from both C57BL/6J and A/J to capture the binding sites engaged in transcriptional activity in either mouse strain, in total yielding 3909 putative regulatory variants of the ventral midbrain (Additional file 6: Table S5).
The capacity of the above approach to reduce the number of meaningful variants is illustrated in Fig. 5A with the examples of the Ddhd1, Zfp619, and 4.5S ribosomal RNA (rRNA) loci. Expression of Ddhd1, a gene coding for a phospholipase, is modestly but significantly reduced in A/J compared to C57BL/6J and shows accessible chromatin at its TSS and at an upstream enhancer site > 20 kb from the TSS. Both regions are marked by H3K27ac signals in both strains. One TF footprint could be identified at both the TSS and the distal enhancer, representing the putative TF binding sites controlling Ddhd1. From total of 603 variants at the 61 kb locus, only one coincides with an active TF binding site occupied in the midbrain, corresponding to a putative regulatory variant influencing Ddhd1 expression in this brain region. Consistently, the affected enhancer shows decreased H3K27ac enrichment in the A/J. This illustrates how majority of genetic variants at any given locus are unlikely to affect gene expression and how, by focusing on those co-localizing within active regulatory regions, those most likely to act as regulatory variants can be identified.
If the midbrain gene regulatory differences between C57BL/6J and A/J indeed depend on the cumulative effect of cis-acting variants, such as the variant at the Ddhd1 locus, the identified regulatory variants would be expected to be enriched in regulatory regions and TF binding sites at the differentially expressed gene loci compared to other expressed genes. To test this directly, we associated all putative regulatory variants to their likely target genes as already outlined in Fig. 2, and calculated the number of variants that on average associate with each of the 5082 differentially expressed genes (FDR < 0.05) (Additional file 7: Table S6) [20]. We did the same for all expressed genes found not to change between the strains (FDR > 0.05) and for an equal number of randomly selected expressed genes as controls. While unaffected genes and randomly selected genes were associated on average with 0.14 and 0.18 regulatory variants, respectively, this number significantly increased to 0.40 regulatory variants for the differentially expressed genes (Fig. 5B). Consequently, genetic variants located in midbrain regulatory regions do not show a random distribution but are instead enriched at the differentially expressed genes, suggesting they play an important role in explaining the observed transcriptomic differences.
Next, we considered whether localization of variants in the TF binding sites of active enhancers could also be associated directly with enhancer activity upstream of gene expression changes. With this aim we used THOR [41] to identify enhancer regions with significantly altered signal for the H3K27ac enhancer mark between midbrains of C57BL/6J and A/J. Interestingly, 1126 of the 3909 putative regulatory variants localized within an enhancer region with differential H3K27ac enrichment under a stringent cutoff (p < 1 × 10–18) (Additional file 8: Table S7). This indicates that a large proportion of putative regulatory variants associate with enhancers that gain or lose activity between the mouse strains. For example, enhancer harbouring putative regulatory variants in the proximity of locus coding for 4.5S rRNA, a ribosome-interacting non-coding RNA, exhibits a strong gain of enhancer activity in A/J compared to C57BL/6J (Fig. 5A). In addition, at locus of Zfp619, a gene coding for a zinc finger TF, both gain and loss of enhancer activity can be observed simultaneously at two separate enhancers associated with multiple putative regulatory variants. Taken together, disruption of TF binding by variants across thousands of enhancer regions is likely to alter enhancer activity, and thereby midbrain gene expression in genetically diverse mouse strains.
Cell-type-specificity of cis-acting variants in the midbrain
Having identified the putative cis-acting regulatory variants contributing to the midbrain gene expression phenotype between C57BL/6J and A/J, we next sought to understand how cell-type-selective these variants are. Overlapping the putative regulatory variants with cell-type-specific accessibility data suggested that majority of the variant binding sites (57.9%) were accessible, with potential to affect gene expression, in at least 6 out of the 10 cell types (Fig. 5C). However, just under 14% of the variants were accessible in only 1 or 2 cell types, indicating non-coding variation can have cell-type-selective effects on gene expression (Fig. 5C). Indeed, variants with cell-type-selective accessibility in only 1–3 cell types were significantly more often occurring at genes with altered expression than at other expressed genes (data not shown).
A number of regulatory variants were accessible specifically in neurons and associated with differentially expressed genes, representing neuron-specific gene regulatory variation. These include for example Tekt5, a gene expressed in excitatory neurons and upregulated in C57BL/6J (Fig. 5D). Despite the observed differential expression, Tekt5 promoter appeared inaccessible in most cell types except for neurons of C57BL/6J, exactly at TSS overlapping with a putative regulatory variant within a TF binding site.
Taken together, while majority of cis-acting variants affect broad array of cell types, a large proportion can have cell-type-specific effects that cannot be dissected without single cell analysis.
TCF7L2 as a mediator of trans-acting variation in midbrain neurons
A large fraction of midbrain gene expression variation could be linked to cis-acting regulatory variants, even with our strict criteria on the presence of variant in a TF footprint located in an active enhancer (Fig. 5A). Still, much differential gene expression remained unexplained. This could be due to cis-acting variants we have missed, but also due to trans-acting variants that can influence a number of target genes by altering a TF’s activity, rather than its binding motif. A change in TF activity could be due to change in its expression level, but could also be due to alternations in other mechanisms controlling TF activity, such as post-translational modifications, protein–protein interactions or TF localization.
Genetic differences in non-dopaminergic neurons (such as Gad2+ neurons), that make up much of our neuron population (Additional file 2: Table S1), have been suggested to contribute to strain specific behavioural differences, including anxiety, reward and motivation traits, such as ethanol consumption [42,43,44,45]. To identify mediators of trans-acting variation between C57BL/6J and A/J in neurons, we performed motif enrichment analysis on 715 regions in neurons showing differential chromatin accessibility between the mice (Fig. 4A). This revealed the binding motif of TCF/LEF family, downstream TFs of the canonical Wnt signalling pathway [46], to be among the most enriched sequences found at the differentially accessible regions (Fig. 6A; p = 1.66e−14). The enrichment was specific for Gad2+ neurons and could not be found in any other cell types (Additional file 1: Figure S3) or in the motif enrichment analysis for cell identity genes (Fig. 3). LEF1, TCF7L1, and TCF7L2 bind to the same DNA sequence but have often opposing or cell-type-specific functions [47]. Inspecting chromatin accessibility across the cell types for the differential binding sites carrying the TCF/LEF motif revealed an increased signal specifically in the neurons (Fig. 6B). To determine which factor is expressed in the midbrain neurons and could mediate the observed enrichment and altered accessibility, we visualized their expression using scRNA-seq data. Lef1 expression was limited to the endothelial cells (Fig. 6C) and Tcf7l1 showed only low or no expression across the cell types (Fig. 6D). However, Tcf7l2 had high expression in the Gad2+ and Slc17a6+ neurons and polydendrocytes, showing a clear overlap with cells enriched for the respective binding motif (Fig. 6E). Consistently, an analysis of putative upstream regulators explaining the transcriptomic changes from RNA-seq of the mouse strains using Ingenuity Pathway Analysis (IPA) predicted TCF7L2 (also known as TCF4) and CTNNB1 (beta-catenin, binding partner of TCF7L2) to be among the top regulators based on the predicted activation score (Additional file 9: Table S8). Thus, the transcriptional activity of TCF7L2 is likely to be altered between C57BL/6J and A/J mice in the midbrain neurons.
Taken together, snATAC-seq analysis of tissues from genetically different strains can guide the elucidation of cell-type-specific impact of trans-acting variants and suggests neuron-specific differences in the canonical Wnt signalling pathway between two commonly used inbred mouse strains.