The original paper can be found here. All figures and tables are from paper.
Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins and peptides.
Some data analysis methods include: Dynamic Time Wrapping (DTW) and Hidden Markov Model (HMM).
Protein homolog search
DTW + HMM
Protein homolog search
DTW + HMM-HMM
subcellular localisation classification
DNA sequence classification
Some widely used traditional ML algorithms in this field include Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes(NB). Logistic Regression (LR), Decision Tree (DT), Light Gradient Boosting Machine (LGBM), extreme Gradient Boosting (XGBoost) and Deep Learning (DL).
Database from the review paper can be found here.
Machine Learning pipeline
The Machine Learning pipeline for the biological sequence classification requires the workflow of (a) a dataset construction, (b) sequence representation and feature selection, (c) model training and evaluation, (d) Software implementation. The last part is not covered in this summary.
There are lots of online dataset libraries for biological data. Most of the amino acid sequence data can be obtained from UniProt. The CD-HIT software can be used to remove homologous sequence with a threshold between 30% to 90%. The lower the threshold, the lower the homology similarity sequence in the dataset.
Sequence representation and feature selection
There are several software platforms developed to generate features from dataset, including iLearnPlus, PyFeat, iFeature, BioSeq-Analysis 2.0, VisFeature and POSSUM.
For DNA and RNA sequence, there are 7 major encoding schemes:
nucleic acid composition
position-specific tendencies of trinucleotides
electron-ion interaction pseudopotentials
autocorrelation and cross-covariance
pseudo-nucleic acid composition
For proteins and peptides, the feature descriptors have 8 categories:
There is also an automatic encoding method using deep learning proposed by Wei which has been proved to be more informative and effective than the traditional methods. There are also other methods based on pre-trained encoding deep learning models which are successful such as unified representation (UniRep), Tasks Assessing Protein Embedding (TAPE), MULocDeep and BiLSTM embedding model.
After sequence representation, we also need to do feature selection to reduce dimensionality.
The most commonly used feature selection methods include:
analysis of variance
maximum relevance and minimum redundancy
Pearson correlation coefficient
Common feature extraction methods are shown here:
Model training and evaluation
Convolutional Neural Network and Recurrent Neural Network are frequently used for prediction research. CNN-based classifier Gene2vec was used for perdition of N6-methyladenosine (m6A) modification sites in mammalian messenger RNA sequences. bidirectional gated RNN based classifier BERMP was used for m6A modification sites in different species as well.
The commonly used evaluation methods include: k-fold cross-validation (k-fold CV), leave-one-out CV (LOOCV), and independent tests. The commonly used metrics include: Acc, Sp, Sn, MCC, ROC and BACC which is employed to measure the accuracy the overall performance of a model trained on imbalanced datasets.
Application in protein sequences
Protein are composed of 20 different types of amino acids. The classification tasks includes conventional unequal-length and equal-length sequences.
The conventional unequal length sequence types include:
RNA-binding proteins (RBPs)
phage virion protein
cell wall lytic enzymes
major histocompatibility complex
biological luminescent proteins
electron transport proteins
plant pentatricopeptide repeat
type III fluctuation systems
The classification of protein equal-length sequences is related to the prediction of protein PTM sites.
For DNA-binding proteins, TargetDBP plus was a new classification tool which used weighted convolution features to encode sequence information.
For RBPs, there are three methods which were applied for the classification studies such as rBPDL (balanced), RBPro-RF, and TriPepSVM. In RBPs, there are positive and negative samples so they are binary classification with imbalanced data. SMOTE was used for imbalanced processing.
For cell wall lytic enzyme classification, there are three tools: CWLy-RF, CWLy-SVM and Jing's method.
For protein PTM sites classification of protein equal-length sequences, There is Phosphorylation predictors developed based on DeepPPSite and DeepPSP, DeepTL-Ubi, CNNAthUbi based on CNN adn ImbClassi_PTMs.