Cost-effective DNA methylation profiling by FML-seq
The FML-seq method
The protocol for FML-seq comprises only three steps (Fig 1A). First, genomic DNA (gDNA) is digested by a methylation-dependent restriction endonuclease that cuts at a certain distance from the 5-methylcytosine or 5-hydroxymethylcytosine in its motif and leaves a short overhang (10). Second, a master mix is added with combined reagents for sticky-end adapter ligation, preparation of the specially designed adapters (15), and indexing PCR. Finally, a single cleanup without size selection is sufficient to purify the library, because the digestion does not produce unusably short fragments (Fig S1) and the adapter design prevents byproducts without gDNA inserts (Fig S2). The resulting library contains unaltered genome sequences alignable by standard pipelines. Each end of a library fragment is derived from a methylation-dependent digestion, so paired-end sequencing detects two methylated cytosine positions per fragment (Fig 1B). FML-seq represents a substantial simplification of previous protocols based on methylation-dependent digestion.
(A) Library preparation reactions. Genomic DNA is digested by a methylation-dependent restriction endonuclease that cuts at a known distance from the methylated cytosine in its motif and leaves a short single-strand overhang of unknown bases. Stem-loop (hairpin) sequencing adapters with complementary random overhangs are ligated to the digested genomic DNA fragments, but the phosphodiester backbone is completed only on one strand because the adapters lack a 5′ phosphate. The resulting single-strand nick is extended by DNA polymerase to fill in a second strand complementary to the adapter’s loop, whereas the unneeded stem strand is degraded. This library of genomic DNA inserts between double-stranded linear short adapters is then amplified by standard polymerase chain reaction with long indexing primers to produce a sequencing-ready library. A standard solid-phase reversible immobilization bead cleanup without size selection is sufficient to purify the library. Paired-end sequencing reads imply the location of the two methylated cytosines resulting in each observed fragment. (B) Counting fragmentation at methylated loci and sequencing fragments as hits at methylated motif sites. The restriction endonuclease used here, MspJI, cuts at the motif mCNNR. Each copy of this motif on either strand implies a potential cut site at a certain distance past its 3′ end. When paired-end sequence reads are aligned to the reference genome, each end of a sequenced fragment counts as one hit for the corresponding motif site; for example, the fragment marked by an asterisk tallies one hit each for the red and green motif sites. The number of hits for a given motif site corresponds to the fraction of genome copies methylated at that motif’s cytosine position.
(A) The enzyme’s recognition domain (blue) binds a motif (blue) containing methylated cytosine (red), mCNNR, whereas its endonuclease domain (magenta) cuts at distances of 13 and 17 bp from the methylated cytosine position. Digestion leaves a 5′ overhang of 4 nt with a terminal phosphate, and terminal hydroxyl on the 3′ underhang. (B) Digestion by MspJI cannot result in very short fragments. The shortest theoretically possible library insert would be generated by recognition of a mCNNR site immediately flanking the 4 nt overhang resulting from a previous digestion, leaving a total insert of 21 bp after ligating the adapters. In reality, MspJI may not necessarily be able to bind its motif without additional paired bases on both sides. Empirically, we observe, in high-input libraries without size selection, only about 0.04% of library inserts shorter than 22 bp, 0.05% shorter than 25 bp, 0.33% shorter than 30 bp (Fig S7A). (C) In the special case of a fully methylated CpG within a YNCGNR palindrome, two MspJI enzymes may digest the DNA symmetrically and produce a 32 bp-insert. In fragmentation at methylated loci and sequencing libraries from human genomic DNA, inserts of this length are disproportionately common but not the majority, about 4% of all inserts (Fig S7A).
For convenience a palindromic YNCGNR site with two complementary CGNR motif sites is shown, though fragments can be produced by a non-palindromic single site and a given fragment does not necessarily contain the motif sites responsible for its digestion. The stem-loop (hairpin) adapters are an equimolar mix of two versions, one for each end of a sequencing-ready library molecule (Illumina P5 and P7), but except for their slightly different sequences, they are used identically and the ligation product has no polarity. By chance, half of the digestion products will be ligated between two of the same adapter (P5 and P5 or P7 and P7), producing an unsequenceable molecule that is unlikely to amplify because of PCR suppression; this limits the efficiency of the library synthesis to 50%. Nick extension replaces the second adapter strand, whereas uracil-DNA glycosylase excises uracil and the abasic sites are hydrolyzed, leaving an amplifiable library molecule. Because the stem-loop adapters lack 5′ phosphate, they ligate to a genomic DNA fragment on only one strand, and if two adapters anneal without a genomic DNA insert, the nick extension stops at the nick on the opposite strand. Oligonucleotide sequences @ 2021 Illumina, Inc. All rights reserved. Derivative works created by Illumina customers are authorized for use with Illumina instruments and products only. All other uses are strictly prohibited.
FML-seq uses sequencing adapters with 5′ overhangs of 4 random bases (4N) for efficient sticky-end ligation to the corresponding overhangs of unknown bases resulting from digestion by a restriction endonuclease whose motif includes methylated cytosine. This would be expected to result in an overwhelming byproduct from dimerization of adapters that ligate directly to each other without a gDNA insert. FML-seq’s library protocol uses a combination of several techniques to prevent this behavior: (1) The adapters are added to the gDNA before digestion. After digestion there is a heat denaturation of the restriction endonuclease, and at this step, adapters that annealed to each other during storage at higher concentration may melt apart and reanneal to the ends of new gDNA fragments instead. (2) The adapters lack a phosphate at the 5′ terminus, which is required for DNA ligase to connect that terminus to a matching 3′ hydroxyl. Thus, neither strand in an adapter dimer can be ligated, and even if two adapters with complementary overhangs temporarily anneal, 4 bp of hydrogen bonding may not keep them together when the temperature is raised in subsequent steps. (3) Before PCR, the same polymerase is used to fill in the second strand of the adapter by extension from the nick where the adapter’s 5′ end is not joined to the gDNA insert’s 3′ end. Even if an adapter dimer holds together at the PCR polymerase’s extension temperature, it will also have a nick on the opposite strand, so the polymerase would have to jump over a gap in the template strand to fill in the new strand. (4) The 5′ end of the stem loop (hairpin) adapter has the 4N sticky-end overhang and a complementary sequence that forms the double-stranded stem. In addition to PCR suppression, this stem could also cause adapter dimerization by melting and reannealing to a different molecule. However, in the stem sequence, all thymine bases are replaced with uracil, and a dU-intolerant proofreading polymerase is used to prevent the stem sequence from being replicated. (5) Furthermore, the stem sequence’s uracils are excised before PCR by uracil–DNA glycosylase (UDG), leaving abasic sites that further deter replication and may be fully destroyed by hydrolysis at PCR denaturation temperature. In this protocol, UDG from a hyperthermophile (Archaeglobus fulgidus) is included in the combined ligation/loop-breaking/PCR master mix, because the traditional Escherichia coli UDG interferes with the low-temperature ligation and would need to be added in an additional step between ligation and PCR, whereas the hyperthermophilic UDG appears inert at ligation temperature. (6) For the Illumina sequencing platform, the adapters are based on the less traditional but equally well-supported Nextera sequence (Tn5 transposon) rather than standard TruSeq, because that sequence is T-rich on the 5′ strand and therefore this protocol’s adapters are U-rich in the portion removed by UDG. (7) These adapter sequences are incomplete and require long PCR primers to extend them to full length with multiplexing indexes. Like the Nextera design, the PCR primers are also incomplete and only partially overlap the ligated adapter sequences. Thus, both an adapter and a PCR primer are required to create a full-length product matching both the flowcell’s amplification primers and the sequencing primers, so neither the original adapters nor the PCR primers can form sequenceable byproducts alone.
An additional consequence of destroying the 5′ ends of the original adapters is that the random overhang is also destroyed; even if it contains a mismatch to the complementary gDNA sequence that is tolerated by the ligase, the mismatched base is not part of the final molecule’s sequence. The final sequence is derived only from the gDNA and the random overhang does not contribute mismatches at the ends of sequence reads.
Previous methods that used MspJI or another methylation-specific restriction endonuclease focused exclusively on the 32-bp fragment produced by two complete, symmetric digestions of a fully methylated CpG (Fig S1C). This requires a precise size selection such as cutting a band out of an acrylamide electrophoresis gel, and discards about 96% of the library product (see Fig S7). In contrast, FML-seq treats every cut site as informative of a methylated base, even where digestion is incomplete or the sequence motif is not palindromic.
Specificity of FML-seq
FML-seq’s specificity for methylated cytosine depends on the endonuclease. Concordant with previous reports (10), digestion of unmethylated gDNA from lambda bacteriophage yielded no detectable library product, unless greatly overamplified (Fig S3A). Reads from methylated gDNA aligned largely at the sequence motif targeted by Dcm methylase (16) as expected; reads from unmethylated gDNA had little enrichment for the methylase’s motif but still aligned at the endonuclease’s motif (10) (Fig S3B).
(A) Electropherograms of fragmentation at methylated loci and sequencing libraries prepared from 60 ng methylated or unmethylated lambda bacteriophage genomic DNA with varying numbers of PCR cycles, four replicates per condition. The optimized protocol uses 15 cycles for this amount of methylated genomic DNA, but unmethylated DNA requires about 24 cycles to produce a similar yield. Note: with more than 15 PCR cycles, the overamplified libraries from methylated genomic DNA are converted into slow-migrating artifacts that appear to the right of the upper marker, giving the appearance of low yield in the measurable length range. (B) Alignment of sequence reads at the expected distance from motif sites. Methylation by the host’s Dcm is expected at the CmCWGG motif; however, MspJI cuts at any mCNNR.
The other factor in FML-seq’s specificity is whether other kinds of gDNA fragmentation produce ligatable ends. To fit the adapters, a fragment must have a 5′ overhang of 4 nt with a terminal phosphate, which is produced by the endonuclease but unlikely to result from other kinds of fragmentation. Accordingly, no-endonuclease control libraries using gDNA from either fresh cells or degraded archival tissue showed no detectable library product unless greatly overamplified (Fig S4).
Electropherograms of fragmentation at methylated loci and sequencing libraries prepared from 60 ng intact human gDNA from fresh cells or degraded human gDNA from formalin-fixed, paraffin-embedded tissue, using either methylation-specific digestion by MspJI restriction endonuclease or a no-endonuclease control, two replicates per condition. The no-endonuclease control shows no library yield unless the number of PCR cycles is greatly increased, at which point, its yield is similar to a no-DNA control (compare Fig S5C). Note: as in Fig S3A, the yields of libraries from the MspJI condition appear to decrease with high PCR cycles because of overamplification artifacts; furthermore, undigested gDNA fragments from the formalin-fixed, paraffin-embedded sample are visible near and beyond the upper marker because of their short length, even where no library is detectable.
Biological validation
To compare FML-seq with other methods, we prepared libraries from four well-studied human cell lines (Fig S5). The sequence reads from human gDNA had high alignability to the reference genome (Fig S6A). Insert lengths varied widely beyond the canonical 32-bp semipalindromic fragment (Fig S1C) (10), but very few inserts were too short to align (Fig S7). Most reads aligned at the expected sequence motif (Fig S6B) (10).
(A) Electropherograms of libraries from 60 ng gDNA, two replicates per cell line. (B) Molar length distributions on a linear axis, normalized to equal totals within the graphing window. Insert length is calculated by subtracting the combined length of the indexed sequencing adapters, 136 bp, from the molecule length. The TapeStation lacks sufficient resolution to show the expected peak of 32 bp inserts (168 bp molecules). (C) Libraries from serial-dilution gDNA samples and no-DNA control.
(A) Alignability of read sequences to the reference genome. “Adapter dimers” are inserts of 10 bp or less and “too short” reads are longer than adapter dimers but shorter than bwa-mem2’s minimum seed length of 19. “Confidently” and “poorly aligned” are uniquely aligned reads above or below the minimum MAPQ of 10 (posterior probability of correct alignment 0.9). (B) Alignment of sequence reads at the expected distance from CGNR motif sites.
Lengths of “all read pairs” are inferred by adapter trimming, but this is limited to inserts shorter than the sequence read length, 154 nt (untrimmed reads are not shown). Lengths of “confidently aligned” read pairs (MAPQ ≥ 10) are inferred by aligned positions in the reference genome. For direct comparison, the denominator of both fractions is the total number of post-filter sequenced read pairs, that is, “all read pairs” sum to 100% but “confidently aligned” read pairs sum to the proportion of confidently alignable read pairs in each library, and for any given insert length below the read length, the relative height of the “confidently aligned” bar reflects the proportion of “all read pairs” at that length. There are prominent peaks at 4 bp from adapter dimers (no insert of genomic DNA [gDNA] sequence); at 32 bp from symmetrically digested, fully methylated CpG sites; and at 151 bp because of the three-base minimum for detection of adapter sequence. (A) All tested cell lines, 60 ng gDNA each. (B) Serial dilutions of HeLa-S3 and IMR-90. (C) No-template controls (no gDNA).
At a variety of functional genome elements FML-seq signal was unimodal and log-distributed, unlike measurements from WGBS which tended toward the extremes of 0% and 100% methylation, and the greatest dynamic range of FML-seq signal was among gene promoters (Fig S8). Promoters were rich in restriction motif sites despite their short lengths (median 15 sites, 97% with at least 1 site; Fig S9), but promoters with few motif sites had a higher density of extreme FML-seq scores, measured as reads per million per restriction motif (Fig S10). FML-seq signal correlated well with DNA methylation signals detected by other methods (Fig 2A and Table S1) (17, 18): the two-channel base-conversion methods (EPIC, WGBS, RRBS) were most similar to one another, but FML-seq was more similar to them than another one-channel method (MeDIP-seq). FML-seq showed lower methylation than other methods for HeLa-S3 in particular, as widespread hypermethylation diluted the one-channel signal, though such extreme genome dysregulation may not be observed in many experiments. FML-seq’s relative signal per promoter (Fig 2B) and estimated fold change (Fig 2C) were correlated with the most comprehensive method, WGBS to a similar degree as the EPIC microarray was correlated with WGBS, though FML-seq’s signal included more zeroes. Testing for differential methylation between biological conditions, the list of differentially methylated promoters detected by FML-seq showed concordance with WGBS and greater sensitivity than RRBS (Fig 2D). These results at the level of functional genome elements stood in contrast to analysis at the level of individual cytosine positions, where sequence read counts were too low (65% of single-position counts were zero) to correspond with measurements from base-conversion methods (Fig S12).
(A) Density of sequence reads from fragmentation at methylated loci and sequencing assigned to CGNR sites within the genome elements. Here, reads per million per motif is normalized against the total number of reads assigned to all CGNR sites, rather than the sum within only a given category of genome elements, to compare the different categories despite covering widely different proportions of the genome (286,435 exons, 297,593 introns, 40,351 promoters, 1,402 enhancers). (B) Methylation of CpG sites within the genome elements according to whole-genome bisulfite sequencing.
(A) Lengths of promoter regions in the reference genome. (B) Number of targets for each DNA methylation profiling method per promoter (CG: canonical human DNA methylation; CGNR: the MspJI restriction endonuclease used to target methylated cytosine in fragmentation at methylated loci and sequencing; CCGG: the MspI restriction endonuclease used to reduce the number of targets in reduced-representation bisulfite sequencing). Both copies of palindromic motifs CG and CCGG are counted unless the potentially methylated CpG straddles the boundary of the promoter, resulting in counts that are nearly always even. The promoters are enriched for CpG dinucleotides (average 6.5 per 100 bp), relative to the entire reference genome (1.1 per 100 bp). (C) Fragmentation at methylated loci and sequencing read counts per promoter in a representative sample, K562 replicate 1. The 1,126 promoters without CGNR motif sites are included as zero counts.
Methylated DNA immunoprecipitation data were not available for two cell lines. (A) Normalized methylation signal (blue: EPIC β, whole-genome bisulfite sequencing [WGBS] and reduced-representation bisulfite sequencing percent methylation, Methylated DNA immunoprecipitation VSD, FML-seq log reads per million per motif) at the 500 promoters with the most variation among cell lines according to WGBS. Technical replicates are shown separately. (B) Methylation scores by promoter for K562 in FML-seq or EPIC versus WGBS, replicates pooled. Only 200 randomly sampled promoters are graphed and used for the blue least-squares regression line but Spearman’s ρ is calculated with all promoters. See Fig S11 for a visualization of all promoters. (C) Methylation fold change by promoter for K562 versus HeLa-S3 in FML-seq or EPIC versus WGBS, replicates pooled. (B) The same subset of promoters are graphed as in (B). The red line shows y = x. (D) Overlap of significantly (BH-adjusted P < 0.01) differentially methylated promoters between K562 and HeLa-S3 according to WGBS (24,240 significant), reduced-representation bisulfite sequencing (6,929), and FML-seq (14,023).
(A) Methylation scores in K562. (B) Fold changes for K562 versus HeLa-S3. Results are calculated from all promoters. The red line shows y = x.
Only cytosines within CpG dinucleotides are considered overall and only the first cytosine of the CGNR motif is considered for FML-seq read counts. Only chromosome 22, the shortest autosome, is considered to reduce the scale of the analysis (1,686,636 cytosine positions, 874,441 in CGNR). (A) FML-seq read counts per cytosine position in a rep- resentative sample, K562 replicate 1. (B) Different platforms’ methylation scores in K562. (C) Different platforms’ fold changes for K562 versus HeLa-S3, at all cytosine positions with CpG dinucleotides on chromosome 22, the shortest autosome. The red line shows y = x.
Minimal sequencing conditions
Serial dilutions from 60 ng gDNA input showed successful but lower-quality results down to 6 ng (Fig S14). Because FML-seq analysis counts entire sequence reads rather than bases within each read, long reads are no more useful than short reads as long as they can be confidently aligned to the reference genome; the shortest available read lengths proved sufficient (Fig S13). Given the diminishing returns of additional sequencing depth (Fig S14), sufficient sequencing for a human gDNA sample is roughly 40 million read pairs or 96 libraries per NovaSeq S2 flow cell. Thus, although sequencing is the main cost of the FML-seq workflow at current prices, that expense is also more economical than other approaches.
Reads shown are from the MiSeq v2 sequencing kit at 2 × 154 nt. Confident alignment is MAPQ ≥ 10. Commonly available read lengths are marked: 2 × 38, 2 × 51, 2 × 76, 2 × 101, 2 × 151.
Lower sequencing depths were simulated by subsampling the full data. (A) Promoters with at least 1 fragmentation at methylated loci and sequencing read attributed to a CGNR motif site within the promoter boundaries. (B) Pearson correlation of log reads per million per motif across all promoters. (C) Spearman correlation of pooled log reads per million per motif with whole-genome bisulfite sequencing pooled percent methylation from the same cell line across all promoters. (D) Silhouette metric of cluster distance (Pearson distance 1 − r to the nearest member of another class ÷ mean distance within the same class), considering only the two cell types used in serial dilutions, HeLa-S3 and IMR-90, for fair comparison. (E) Promoters with significantly differential methylation (DESeq2 BH-adjusted P < 0.01) between HeLa-S3 and IMR-90. The number of promoters with nonzero read counts or statistical significance inevitably increases with total sequencing depth as smaller methylation signals and effect sizes become detectable; however, replicate correlations and silhouettes indicate the total amount of usable information in the sequencing library is reached more quickly as additional depth yields only duplicate reads.
Legal Disclaimer:
EIN Presswire provides this news content "as is" without warranty of any kind. We do not accept any responsibility or liability for the accuracy, content, images, videos, licenses, completeness, legality, or reliability of the information contained in this article. If you have any complaints or copyright issues related to this article, kindly contact the author above.