Identification of DNA N6-methyladenine sites by integration of sequence features

Background An increasing number of nucleic acid modifications have been profiled with the development of sequencing technologies. DNA N6-methyladenine (6mA), which is a prevalent epigenetic modification, plays important roles in a series of biological processes. So far, identification of DNA 6mA relies primarily on time-consuming and expensive experimental approaches. However, in silico methods can be implemented to conduct preliminary screening to save experimental resources and time, especially given the rapid accumulation of sequencing data. Results In this study, we constructed a 6mA predictor, p6mA, from a series of sequence-based features, including physicochemical properties, position-specific triple-nucleotide propensity (PSTNP), and electron–ion interaction pseudopotential (EIIP). We performed maximum relevance maximum distance (MRMD) analysis to select key features and used the Extreme Gradient Boosting (XGBoost) algorithm to build our predictor. Results demonstrated that p6mA outperformed other existing predictors using different datasets. Conclusions p6mA can predict the methylation status of DNA adenines, using only sequence files. It may be used as a tool to help the study of 6mA distribution pattern. Users can download it from https://github.com/Konglab404/p6mA.


Background
DNA N 6 -methyladenine (6mA) is an important epigenetic modification of nucleic acid, firstly characterized in bacteria [1]. In contrast to 5mC, 6mA remains poorly studied and was previously thought to only occur in prokaryotes [2]. Accumulating evidences, however, has confirmed that it also exists in eukaryotes, including zoological and botanical species (e.g., Arabidopsis thaliana, Mus musculus, Danio rerio, and Sus scrofa). Recently, two studies found that DNA 6mA sites also exist extensively in the genomes of humans [3] and rice [4], thus deepening our understanding of this modification in high-grade organisms.
DNA 6mA plays important roles in various biological processes, such as the restriction-modification system [5,6], DNA replication and repair [7,8], nucleoid segregation [9,10], and transcription [11]. To detect DNA 6mA modification, a series of experimental methods have been developed, such as methylated DNA immunoprecipitation sequencing [12], liquid chromatograph-tandem mass spectrometry [3], capillary electrophoresis and laser-induced fluorescence [13], and single-molecule real-time sequencing (SMRT-seq) [14]. However, these experimental procedures are expensive and time-consuming and thus largely limited its application in DNA 6mA study, urging the necessity for the development of bioinformatics-based approaches to predict methylated adenine sites in genomes.
Machine learning builds models by handling features to perform specific tasks and has been widely applied in biological issues, including post-transcription RNA

Open Access
Epigenetics & Chromatin  [15][16][17], promoter discovery [18][19][20], and nucleotide modification prediction [21][22][23]. The occurrence of 6mA relies on the properties of its surrounding sequences, which play vital roles in methyltransferase/ demethylase-dependent catalytic processes [3,24,25]. Recently, some machine learning-based predictors, e.g., iDNA6mA-PseKNC [26] and i6mA-Pred [27], were developed to identify 6mA sites at the genomic level. The former was trained with mouse data and achieved a high recall ratio in several datasets, whereas the latter was designed to predict 6mA sites in rice. To the best of our knowledge, however, there is no 6mA predictor trained on multi-species data.
In this study, we constructed a predictor, p6mA, to identify DNA 6mA sites by sequence-based features. The predictor was trained on dataset from four species: i.e., Oryza sativa (rice), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (worm), and Homo sapiens (human). The DNA sequences were transformed into numeric vectors by extracting 172 features. We selected key features using the maximum relevance maximum distance (MRMD) method [28] and constructed the predictor using the Extreme Gradient Boosting (XGBoost) algorithm [29]. Comparison with other existing tools demonstrated that p6mA outperformed other methods in several aspects. Users can download p6mA from https ://githu b.com/Kongl ab404 /p6mA.

Nucleotide composition and conservation analysis
In this study, we constructed an aggregated benchmark dataset by four species' data. There are 3040 positive samples and 3040 negative samples (Table 1). All the samples are 41 nt long with an adenosine (A) in the center. We adopted Two Sample Logos [30] to visualize significantly overrepresented and underrepresented sites with a threshold of p < 0.05. The nucleotide enrichment status, as shown in Fig. 1a, showed that there exists nucleotide distribution bias between 6mA and non-6mA containing sequences. For example, in 6mA-containing sequences, GAGG motif was enriched in center and adenosine was enriched in the + 4 nt position. The above results indicated that the surrounding nucleotide composition information can be adopted to discriminate 6mA and non-6mA sites.
Next, we investigated whether sequence bias can cause differences in conservatism. We then performed entropy analysis, aiming to determine trinucleotide-positioned conservatism differences between 6mA and non-6mA sites [31]. The entropy of trinucleotides at each position was calculated as follows: where n denotes the total number of trinucleotide combinations in the ith position and p(3mer j |i) denotes the frequency of the jth trinucleotide at the ith position in the positive/negative samples. Two 39-dimensional numerical vectors were generated to express the entropy values at positions of positive and negative samples.
Information entropy was used to evaluate chaos in the signal processing field and help to reflect conservatism [31]. A lower entropy value, which means less chaos, indicates that the site concerned is more conserved. The Two Sample Logos and entropy analysis results both supported that positioned nucleotide information is able to discriminate between 6mA and non-6mA sites, thus providing a reasonable basis for the application of positioned sequence feature extraction methods like PSTNP.

Feature selection and parameter tuning
We used three methods (i.e., PSTNP, EIIP, and physicochemical properties) to extract features. Each sample was transformed into a 172-dimensional numerical vector, though the feature set also included redundant features. To reduce computational resource waste, we used MRMD score, an index positively related to feature importance, and incremental feature selection (IFS) to select optimal feature sets for each dataset. Features were ranked by MRMD score from highest to lowest. The features from the ranked list were then added one-by-one to a new set and used to construct an XGBoost-based model with default parameters. Model performance was evaluated by tenfold cross-validation and the feature set with highest accuracy was chosen as the optimal set. As shown in Fig. 2a, the highest accuracy (82.47%) was obtained when the optimal 124 features were included. Therefore, we trained the model by its top-ranked 124 features. We then trained models for each dataset using their optimal feature sets. To obtain better performance, a grid search strategy was used to conduct model tuning. Three parameters (i.e., gamma, eta, and max_depth) of XGBoost were optimized in the spaces [0, 0.2], [0.1, 0.5], and [2,10] with steps of 0.1, 0.1, and 2, respectively. We trained 525 models and the parameter set with the highest tenfold cross-validation accuracy was chosen as the parameter set of the model. Accuracy scatter plots of the model based on different parameter combinations are shown in Fig. 2b. Results demonstrated that the optimal parameter set was gamma = 0.16, eta = 0.3, and max_ depth = 4. Thus, we trained the model using the three optimal parameters. Then we performed jackknife test to evaluate its performance and the accuracy is 82.04%. Accordingly, a predictor named p6mA was implemented.

Comparison with existing predictors
To evaluate the prediction performance of p6mA, we compared it with three existing predictors, i.e., iDNA6mA-Pred [27], iDNA6mA-PseKNC [26] and MM-6mAPred [32]. The jackknife test result was applied to measure the predictive power of our methods.
As shown in Fig. 3a, p6mA performed better than the three predictors, it obtained the highest values among the four metrics (i.e., sensitivity, specificity, accuracy, and Matthews Correlation Coefficient). MM-6mAPred has the second highest accuracy (Acc) and second highest Matthews Correlation Coefficient (MCC), while its sensitivity (Sn) is 70.46%, which is ~ 10% lower than that of p6mA. iDNA6mA-PseKNC's sensitivity is 77.27%, while its specificity is only 5.95%. i6mA-Pred obtains specificity of 82.37%, while the sensitivity is 64.31%. The details of the performances can be found in Additional file 1: Table S1.
MM-6mAPred provides the prediction score for each sample, so we plotted its receiver operating characteristic (ROC) curve and compared it with ROC curve of p6mA (Fig. 3b). The area under the ROC curve (auROC) were

Independent validation on A. thaliana dataset
We then performed independent validation on a dataset from another species. As a vital model flowering plant, A. thaliana is a good species to test our predictor, with its N 6 -methyladenine modification landscape previously reported in 2015 [33]. The modification data of A. thaliana were downloaded from MethSMRT and a dataset for independent validation was constructed. We obtained 1055 non-redundant positive samples and 1055 nonredundant negative samples from the reference genome TAIR10. The dataset construction method was similar to the benchmark dataset.
Overall, we demonstrated the robustness of p6mA and its superiority over other existing methods by independent validation. The details of the performances can be found in Additional file 1: Table S2.

Software package introduction
To facilitate the application of our predictor, we implemented p6mA in R language, with the code stored in GitHub (https ://githu b.com/Kongl ab404 /p6mA). The feature extractor methods were also implemented, users just need to provide the input data in fasta format files. Each sequence of the input file should be a 41-bp-length sequence and the center position (e.g., the 21th nucleotide) is the A (adenine) for predict, like: > human_seq_1101_hg19 ATA GTG TAG TGA GCG TAC GTA ACG TGA AGT GAG TGA GTAGC.
The output file of p6mA is a text file containing sequence names, scores, and predicted modification status. The detailed usage and installation guide and the example input/output files can also be found on the repository page.

Discussion
In this study, we built a 6mA predictor p6mA and showed that it is a more robust and competitive 6mA predictor than other existing ones, as determined using benchmark dataset and independent validation. We not only developed a convenient tool for predicting 6mA, but also indicated the portability of the position-based feature extraction method in biological subjects, especially in nucleotide modification prediction. Besides, recent research showed that different species may have different 6mA preference motifs, e.g., the AGAAT motif of C. elegans [34]. This phenomenon prompts us that it is necessary to build 6mA predictors by multiple species' data.
In the independent test by A. thaliana data, although p6mA obtained a higher auROC than that of MM-6mAPred, the ROC cures displayed that the specificity of p6mA needs to be improved. Due to the rarity of modified nucleotides in genome, some rare category exploration methods, e.g., RCLens [35], could also be adopted into uncommon nucleic acid modification prediction problems in the future.
As an epigenetic modification, methylation of adenine is a complex biological process that may be affected by other factors, such as chromatin topology and cytoplasmic physicochemical properties. Therefore, we will incorporate additional information to improve the performance of the discrimination ability in the future and different feature extraction methods may be used to construct a more powerful epigenetic modification predictor.

Conclusions
In summary, we developed a new bioinformatics tool, p6mA, for predicting 6mA-modified sites. We also implemented the predictor as an R software package for ease of use. p6mA was built by multiple species' data and it may help the investigation of 6mA modification pattern in different species' genomes.

Benchmark dataset construction
Fruit fly and worm 6mA-positive samples were obtained from the MethSMRT database [36] and human 6mApositive samples were obtained from the HuaXia1 assembly [37]. To construct a high-quality dataset, 6mA sites with identification Qv scores, which represent the confidence level of a modification, of less than 30 (p-values < 0.001) were filtered. We extracted ± 20 nt sequences from the 6mA sites for each sample to a final sequence length of 41 nt. To reduce sequence-homology bias, CD-HIT v4.6.8 [38] was utilized to generate non-redundant sequence sets with an identity threshold of 0.6.
The 6mA-negative samples from the above three species (i.e., fruit fly, worm, and human) were constructed by selecting non-6mA adenines randomly from the reference genomes (hg38, dm3, and ce10). To ensure negative sample quality, the non-6mA sites were not located in the ± 500bp flanking regions of positive 6mA sites. We also extracted ± 20 nt sequences for each negative site as the negative samples, and sequence identity was also less than 0.6. 880 6mA-positive and 880 6mA-negative samples of rice were obtained from i6mA-Pred (http://lin-group .cn/serve r/ i6mA-Pred) and retrieved by SMRT-seq [4]. All sequences were 41 nt long, with the 6mA site at the center position.
Finally, we constructed benchmark dataset from four species' data: i.e., rice, fruit fly, worm, and human. The final benchmark dataset contains 3040 positive samples and 3040 negative samples. Each DNA sequence in the study could be simplified as the formation: where represents the ith nucleotide in the sequence. Here, we used the following three sequence-based features: (1) electron-ion interaction pseudopotential (EIIP); (2) position-specific triple-nucleotide propensity (PSTNP), and (3) physicochemical properties. These feature extraction methods were implemented in our in-home R package RTFE (https ://githu b.com/ritia njian g/RTFE), the details of which are introduced in the following sections. Briefly, N i ∈ A(adenine), C cytosine , G guanine , T thymine we transformed each sample into a 172-dimensional numerical vector.

EIIP features
Electron-ion interaction pseudopotential, which reflects the electronic properties of nucleotides, was first used to predict the coding potential of genomic regions [39]. EIIP-based feature extraction methods were then widely applied in field prediction and classification, including the prediction of nucleosome positioning [40] and identification of E-gene signature [41].
The EIIP feature vector was constructed as follows: where EIIP xyz denotes the average EIIP value of three nucleotides (x, y, and z), f xyz denotes the frequency of the 3-tuple nucleotides xyz in the sample sequence and x, y, z ∈ (A, C, G, T). The EIIP values for the four nucleotides are: Using this method, we generated 64 features.

PSTNP features
Position-specific triple-nucleotide propensity describes the differences in nucleotide composition at each position between the sequences with and without 6mA modification. As a statistics-based feature extraction method, PSTNP has been used to address multiple molecular biological problems, including DNA N 4 -methylcytosine (4mC) site prediction [22], enhancer prediction [42], and σ70 promoter predictor [43]. Two subtypes of PSTNP were used in this study, i.e., single-stranded and double-stranded (PSTNP SS and PSTNP DS , respectively). The PSTNP SS features are based on the single-stranded characteristics of DNA and contain 64 (4 3 ) trinucleotides: AAA, AAC, AAG, …, TTT. Thus, for a sequence with a length of l-bp, the detailed information of the trinucleotide positions can be expressed by a 64 × (l − 2) matrix Z: where the variable Z i,j = F + 3mer i |j − F − 3mer i |j i = 1, 2, . . . , 64; j = 1, 2, . . . l − 2 .
F + (3mer i |j) and F − (3mer i |j) denote the frequency of the ith trinucleotide (3mer i ) at the jth position in the positive and negative datasets, respectively. 3mer 1 is AAA, 3mer 2 is AAC,… 3mer 64 is TTT in Eq. 6.
The sample in Eq. 1 can be expressed as the PSTNP SS vector: where T is the transpose operator and φ v is defined as: PSTNP DS features characterize double-stranded positionspecified information according to complementary pairing. We deemed A and T as identical, the same to C and G. Each sample could be converted into a sequence containing A and C only. For example, the DNA sequence "TCG AGT GAC" could be converted into "ACC ACA CAC". There are only eight (2 3 ) trinucleotides: AAA, AAC,…, CCC. Thus, for a sequence whose length is l-bp, detailed information on trinucleotide positions can be expressed by an 8 × (l − 2) matrix Z′: where the variable F + (3mer i |j) and F − (3mer i |j) denote the frequency of the ith trinucleotide (3mer i ) at the jth position in the positive and negative datasets, respectively. 3mer 1 is AAA, 3mer 2 is AAC, …, 3mer 8 is CCC in Eq. 10.
The sample in Eq. 1 can be expressed as the PSTNP DS vector: where S′ is the converted sequence and T is the transpose operator. In this formula, φ′ v is defined as: Here, both PSTNP SS and PSTNP DS generated 39 features.

Physicochemical properties
The pseudo-amino acid composition (PseAAC) method has been successful used to address many computational proteomics problems [44][45][46][47] and hastened the application of the pseudo k-tuple nucleotide composition (PseKNC) method. In this study, we used a simplified Type-II PseKNC based on physicochemical properties, which can represent the long-range interaction between oligonucleotides. The physicochemical Type-II PseKNC feature was constructed as follows: where d i reflects the long-range sequence-order physicochemical effect of a DNA sequence whose length is L-bp and definition is: In Eq. 15, λ denotes the tiers or correlation ranks along a DNA sequence and should be set to a signless integer less than L − k. Λ is the number of physicochemical properties used in feature construction. J ψ i,i+m denotes the correlation of the ψth physicochemical property between the ith dinucleotide (N i N i+1 ) and (i + m)th dinucleotide (N i+m N i+m+1 ). J ψ i,i+m can be calculated by: where H ψ (N i N i+1 ) and H ψ (N i+m N i+m+1 ) are the values of the ψth physicochemical property for dinucleotides N i N i+1 and N i+m N i+m+1 , respectively. In this study, six double-stranded B-DNA physicochemical properties (e.g., rise, ring, shift, slide, tilt, and twist) from DiProGB (https ://dipro db.leibn iz-fli.de/ShowT able.php) were used.
Before substituting values into Eq. 16, the original property values were standardized by the formula: is the original ψth physicochemical property value for N i N i+1 and �•� brackets are the average of quantity therein over the 16 different combinations of A, C, G, and T for N i N i+1 . SD is the standard deviation of the corresponding 16 property values.

Feature selection
Maximum relevance maximum distance (MRMD) [28] was used to select the features. The software package of MRMD was obtained from http://lab.malab .cn/soft/MRMD/index .html.

Gradient Boosting decision trees
The Gradient Boosting algorithm constructs a strong ensemble learner using multiple weak learners, such as decision trees, and has been applied in a series of biologically supervised classification projects, including prediction of gamma-aminobutyric acid type-A receptors and hot spots at protein-protein interfaces [48,49]. The Extreme Gradient Boosting (XGBoost) algorithm proposed by Chen and Guestrin [29] is an efficient implementation of Gradient Boosting and has been used extensively by data scientists [50]. The R interface in xgboost v0.81.0.1 was used in this study.
Appropriate tuning of parameters can strengthen a predictor's discrimination ability. We performed parameter tuning by grid search, with three parameters thus optimized: i.e., maximum tree depth for base weak learners (max_depth, from 2 to 10, step by 1), learning rate (eta, from 0.1 to 0.9, step by 0.05), and gamma (gamma, from 0 to 0.2, step by 0.002). We herein used tenfold cross-validation to select the optimal parameters by accuracy.

Performance assessment
We used the jackknife test to evaluate the predictor's performance [51]. Four indices were adopted: i.e., sensitivity (Sn), specificity (Sp), accuracy (Acc), and Matthews Correlation Coefficient (MCC). The indices were defined as: where N + − is the number of positive samples incorrectly predicted to be negative, N + is the total number of positive samples, N − + is the number of negative samples incorrectly predicted to be positive, and N − is the total number of negative samples. The four metrics above are valid only for single-label systems.
Additional file 1: Tables S1, S2. Summary of performances of 4 predictors on benchmark dataset and A. thaliana dataset.