Two sets of nine tissues derived from two phenotypically normal adult individuals, one female and one male, were provided by the NCI-funded Cooperative Human Tissue Network http://chtn.nci.nih.gov (CHTN-32364 and CHTN-32505, respectively).
Digestion, size-fractionation, and labeling of DNA samples
We used spleen as the reference in all tissue comparisons to keep the denominator for each clone's ratio roughly constant across our experiments. Spleen DNA was used as reference for the intra-individual tissue comparisons because it was the tissue for which DNA was most abundant for both donors. Phenol- and chloroform-extracted and then isopropanol-precipitated DNA derived from test and reference tissue samples were divided into tubes (20 to 60 μg per tube) and independently processed and analyzed. Each DNA sample was digested to completion with the methylation-sensitive restriction enzyme, Hpa II (New England Biolabs). We monitored completion of digestion by agarose gel electrophoresis (data not shown). The digested DNA was fractionated by size on 5% to 30% sucrose gradients by using a Beckman SW-40TI swing-out rotor as previously described . Fractions containing fragments smaller than 2.5 kb and greater than 80 bp as judged by gel electrophoresis were selected, pooled, and precipitated with isopropanol (Figure 1). The resulting DNA was analyzed by agarose gel electrophoresis to confirm that this process had effectively selected fragments 80 to 2,500 bp in length (data not shown). One microgram each of digested, size-selected test and reference DNA was differentially labeled with Cy3-dCTP (green) or Cy5-dCTP (red) (both GE Healthcare), respectively, by random priming using the Bioprime DNA Labeling System (Invitrogen).
Chromosome 1 tiling array hybridization
Preparation and validation of the human chromosome 1 genomic-clone microarray was previously described [55, 70]. Briefly, a total of 2,136 large-insert genomic clones were spotted in duplicate onto a glass slide. For simplicity, we will refer to this array as a 'BAC array', although the spotted clones include BACs, PACs, fosmids, and cosmids. Of these clones, 2,049 represent a tiling path covering 213 Mb, approximately 96% of the euchromatic regions of chromosome 1. Seventeen clones represent sparsely spaced segments of the X chromosome, with the first at 35 Mb and the last at 145 Mb with median spacing of approximately 6 Mb. These clones had been included on the array as controls for sex-mismatched conventional array CGH assays for other studies. The remaining 70 clones were included on the array as they were previously thought to map to chromosomes 1 or X, but are excluded from the results we report as their chromosomal mapping is now uncertain. Clone coordinates in the May 2004 assembly of the human genome sequence (Build 35) were provided by the Sanger Center and are available upon request. These coordinates typically imply clones longer than the spans shown on the NCBI or UCSC genome browsers due to trimming of sequences during genome assembly to eliminate overlapping sequence.
Labeled test and reference samples were mixed with 85 μg of Cot-1 DNA (Invitrogen) and hybridized to arrays under conditions described elsewhere . Each experiment was done in triplicate using aliquots of the DNA isolated from each tissue sample, but processed independently. The replicates were independently digested, size selected, labeled and hybridized, and paired with similarly processed spleen DNA from the same donor. Replicate arrays of a given tissue usually employed different independently processed replicates of the spleen reference DNA as well (see Additional file 1).
Array data analysis
Image acquisition was performed using an Axon 4000 B scanner and hybridization intensities were analyzed with the GenePixPro image analysis software (both Axon Instruments). For each spot, GenePixPro gave raw intensity values with the surrounding background subtracted for each wavelength scanned (635 nm and 532 nm for red and green channels, respectively). A custom R script, Normalization.r, was used to log2-transform these values, calculate the ratio of red to green fluorescence intensities, and apply a loess normalization procedure  based on GC content of the BACs (see below). A script then processed the output file to identify and eliminate clones whose duplicate spots reported significantly different normalized log2 ratios (that is, with standard deviation (SD) > 0.28) or had been flagged by GenePixPro as bad spots (that is, spots having less than 80% of pixels with intensities more than one SD above the background pixel intensity in either wavelength channel). We also eliminated results for subtelomeric clone RP5-857K21, as it appeared as an outlier on almost every array. The log2 ratios for the two spots for each remaining clone were averaged, and the median of all the clone averages was calculated and set to zero.
Our computer simulations of the method (see below and Results) indicated that relative fragment yield at a specified methylation level, as well as predicted intensity ratios for a pair of samples having different set methylation levels, vary with the GC content of the clones on the array. Therefore, we adjusted for GC content as follows: for each block of spots on each array, a loess curve was fitted to the relationship between the BACs' GC contents (in per cent) and their raw log2 ratios. The loess predicted value for each BAC's log2 ratio based on its GC content was subtracted from its real raw log ratio to give a normalized log ratio.
Using the GC-normalized data for each array, we then calculated the MAD for a contiguous set of 97 clones in 1q31–1q32 that contains relatively few Hpa II sites and showed little clone-to-clone variation in self-to-self comparisons and little deviation from the median in any of the sample comparisons. This group of clones was used in each array experiment as an internal measure of experimental noise to help distinguish potentially biologically meaningful deviations from experimental noise, which could vary from one array to another. This region of clones, whose midpoints are between positions 184,314,284 and 196,448,439 bp in chromosome 1's sequence in Build 35, is marked in each methylation profile with a bar marked 'MAD'. We used the MAD value as it is relatively robust to outliers and does not assume that values are normally distributed. The MAD value was calculated by subtracting the median log2 ratio for these selected clones from each clone's log2 ratio to give a deviation value; the MAD value is the median of the absolute values of these deviations. The SD of a normal distribution can be estimated to be 1.48 times the MAD value. We identified clones anywhere on chromosome 1 whose average log2 ratio values deviated below or above the overall median by >4.88 MAD units (that is, >3.29 times the SD), which would correspond to deviations with statistical significance at the predictive value of 0.001 based on a normal distribution with two-sided P value. In the self-to-self comparisons, we also used all clones to calculate the MAD and identify outliers beyond the P < 0.001 cutoff.
These outlying values, marked in each methylation profile, should include loci having true methylation differences between samples, since only 0.1% of the ratios drawn from a normal distribution (that is, two false-positive clones per array) are expected to deviate from the median by >4.88 MAD units. However, this threshold does not exclude all false positives due to experimental variation. The 97-clone region of 1q31–32 shows less experimental variation than the rest of the chromosome even in direct comparisons of the same tissue sample (for example, female medulla in Additional file 11). Only 0.5% of the ratios in this direct comparison deviate from the median by >4.88 MAD units when MAD is calculated over the entire array, whereas 2.2% do when MAD is calculated over the 97-clone region.
To control for potential genomic copy number differences between test and reference genomes, we also competitively hybridized 1 μg each of sonicated test and reference total DNA, mixed with 100 μg of Cot1 DNA (Invitrogen), to replicas of the same microarrays using the same conditions as described above . No significant deviations in copy number were observed across chromosome 1 in any tissue sample relative to the spleen sample from the same donor in these conventional array-based comparative genomic hybridization (array CGH) assays, except the female lung sample (data not shown). Because log2 ratios for the methylation-profiling array correlated very strongly with the wildly varying log2 ratios from the conventional CGH array for this sample (in isolation, each profile appeared to have large random noise), we eliminated this sample from further analysis. Genomic content abnormalities were noted for this lung sample in another study .
Note that it is not possible using this comparative method alone to determine what ratio represents an equivalent methylation state in test and reference DNA, as is possible with some sequencing-based methods (for example, [36, 49]). Thus, while a normalized log2 ratio of 0 (the median) might represent equivalent actual levels of methylation in a comparison of two normal tissue samples, it probably would not in a comparison of cells deficient in DNA methyltransferase activity and control cells.
In order to understand the response of our array to differently methylated genomes, we developed a PERL script, METHBATCHSIM, to simulate methylation states for a sequence of interest, such as a particular clone on the array, and the corresponding yield of fragments through our methylation-profiling procedure. First, we generated a list of Hpa II restriction sites on the sequence and used RepeatMasker  to generate a repeat-masked sequence file. METHBATCHSIM reads in the Hpa II and masked-sequence files and randomly designates which restriction sites in each clone contain methylated CpGs and which are unmethylated, for a specified overall methylation proportion. Specifically, we used the rbinom function of the R package  to randomly assign methylation status of each site, using the desired overall methylation fraction (between 0% and 100%, in steps of 1%) as the 'prob' parameter, the number of restriction sites in the BAC as the 'n' parameter, and the 'size' parameter set at 1. The sequence is then virtually digested with Hpa II based on the assigned methylation states, and we determine which of the resulting fragments are within the specified size range (80 to 2,500 bp). We simulated labeling by random priming by summing the total unmasked base pairs in all fragments meeting the size-range criterion. The final output summarizes these totals, averaged over 1,000 simulation runs. We repeated this simulation using the sequence of each clone on the array.
Other bioinformatic analyses
CpG island coordinates were taken from the UCSC Genome Browser http://genome.ucsc.edu CpG Island track, where CpG islands are defined as regions of >200 bp with GC content ≥ 50% and a ratio >0.6 of observed number of CG dinucleotides to expected number on the basis of the numbers of Gs and Cs in the segment. Gene coordinates were obtained from the UCSC's RefSeq track.
In order to test whether the relative methylation levels we measured relate to transcription levels, we compared our results with expression data obtained by Ge et al. , who hybridized RNA from each of 36 normal human tissues singly to Affymetrix oligonucleotide microarrays to determine expression levels of approximately 20,000 human genes. No RNA was available to study expression levels in samples from the same donors used for our methylation studies, but because methylation array ratios were highly correlated between the two donors we studied, it seemed reasonable to assume that methylation states would be similar in the tissues of donors used by Ge et al. and, therefore, reasonable to think that if any consistent correlation between regional methylation and expression levels exists, it might be observed even if different tissue donors are used for the methylation and expression arrays. We averaged methylation array ratios for the same tissue comparisons (that is, across sets of six arrays for each of cerebellum, heart and liver, and across sets of three arrays for each of lung, testis and ovary, compared with spleen in all cases). We used these cross-array averages to recalculate the MAD-based thresholds used to classify BACs as having outlying methylation array values.
Starting with Ge et al.'s raw probe-level expression-array data (GEO series GSE2361), we applied standard background adjustment, normalization and log-transformation methods using the robust multi-array average algorithm from Bioconductor's affy package  to obtain expression levels for each probe set (a probe set is a set of around 20 oligonucleotide probes that together represent a portion of a single gene). For each of the six tissue comparisons made (Figure 7, Additional file 10), we combined normalized expression results for the relevant tissue types to obtain log2 expression ratios. A table of Affymetrix probe set names, their corresponding gene symbols, and genomic coordinates of those genes in Build 35 of the human genome assembly was obtained from the SCGAP Hematopoietic Stem Cells Project . We then plotted the expression ratios of probe sets corresponding to genes whose 5' ends mapped in BACs of each three implied categories of implied relative methylation levels: 'High', 'Mid' or 'Low'. These conservative classifications correspond to BACs with ratios on the methylation-sensitive arrays that exceed MAD-based thresholds discussed above (that is, for Figure 7, 4.88 MAD units from the median, a threshold chosen such that we would expect approximately 0.05% of BACs on the array to fall in each of the Low or High categories by chance). In each boxplot, the median is indicated by the thick line, the box spans the middle two quartiles of the distribution, the small circles are outliers, and the whiskers extend to the most extreme data point which is no more than 1.5 times the length of the box away from the box.
In order to test the statistical significance of the observed differences in relative expression ratios between genes in BACs of different methylation ratio categories, we needed to account for the many-to-many relationships that exist between BACs, genes and probe sets. A BAC can contain more than one gene, a gene's 5' end can be present in more than one overlapping BAC, and some genes are represented on the expression array by more than one probe set (located in different parts of the gene). We therefore fitted generalized estimating equations to the data (using R's geeglm function from the geepack package ), treating datapoints from the same BAC as groups. We also gave each BAC-probe set datapoint a weight equal to the reciprocal of the number of times its gene symbol appeared in the table of all BAC-probe set pairs (that is, a gene represented by three probe sets and whose 5' end is in two overlapping BACs would receive a weight of 1/6 for each of its six datapoints). For each tissue comparison, we performed two statistical tests: (a) a test of whether expression ratios differ between BACs in the 'High' and 'Low' methylation categories (P values on square brackets, Figure 7 and Additional file 10); (b) a trend test for whether expression ratios are linearly related to methylation category, when methylation levels are coded as -1 (for low array ratio and 'High' implied relative methylation level), 0 ('Mid'), or 1 (for high array ratio and 'Low' implied relative methylation level) (P values on arrows, Figure 7 and Additional file 10). In each test, we used the 'Gaussian' family to model variation in expression ratios and 'independence' as the correlation structure.
PCR, cloning and sequencing of bisulfite-modified DNA
Total genomic DNA was extracted from male heart and spleen samples, and unmethylated cytosines were converted to uracil by bisulfite treatment according to published protocols . Briefly, each DNA sample was first digested with Bam H1, the enzyme was removed by phenol and chloroform extraction, and the DNA was subsequently ethanol precipitated and resuspended in H2O. NaOH was added to each DNA sample to a final concentration of 0.3 M, and the DNA was denatured for 2 min at 97°C followed by incubation for 30 min at 39°C. A solution of sodium bisulfite and hydroquinone was immediately added to 3.3 M and 0.67 mM final concentrations, respectively, and samples were incubated for 16 h at 55°C with 95°C spikes for 5 min every 3 h. Samples were desalted using QIAquick PCR Purification columns (Qiagen) according to the manufacturer's instructions. NaOH was added to each converted sample to a final concentration of 0.3 M, and the samples were incubated at 37°C for 15 min to remove excess sodium bisulfite. DNA samples were then ethanol precipitated and resuspended in 100 μl H2O.
Primers for selected regions were designed using the BiSearch Web Server  and are provided upon request. Each amplification was done with touchdown PCR using 2 μl Ex-Taq enzyme (Takara Mirus Bio) and 0.5 μl of 50 μM primers in a total volume of 25 μl with the following conditions: 95°C for 2 min, 30 cycles of 95°C for 30s, 66°C for 30s (stepping down 2°C every 2 cycles until 56°C), and 72°C for 1 min, followed by 72°C for 10 min.
Amplified products from converted DNA were gel-purified with QIAquick Gel Extraction columns (Qiagen) according to the manufacturer's instructions and subsequently cloned into the pCR2.1 plasmid vector using the TOPO TA Cloning Kit (Invitrogen). Clones with inserts were identified by PCR amplification using M13 reverse and T7 forward primers using TOPO cloning instructions. Amplified M13-T7 products were purified with Sephacryl S-300 (Amersham BioSciences). Seven to thirty-two clones were then sequenced successfully from each PCR product using the same primers on an ABI 3730 (Applied Biosystems). Sequences were aligned and analyzed using PhredPhrap and Consed [80–82].
All array data relevant to this manuscript have been submitted to the Gene Expression Omnibus (GEO) under accession number GSE12925.