// StartMathJax Script window.MathJax = {loader: {load: [ 'input/asciimath', 'ui/lazy', 'output/chtml', 'ui/menu']} }; (function() { var script = document.createElement('script'); script.src = "https://cdn.jsdelivr.net/npm/mathjax@3/es5/startup.js"; script.async = true; document.head.appendChild(script); })(); // UpdateTypeset Script config = { attributes: true, childList: true, subtree: true }; // Callback function to execute when mutations are observed callback = (mutationList, observer) => { for (mutation of mutationList) { if (mutation.type === 'childList') { console.log('A child node has been added or removed.'); MathJax.typeset(); } else if (mutation.type === 'attributes') { console.log(`The ${mutation.attributeName} attribute was modified.`); } } }; // Create an observer instance linked to the callback function observer = new MutationObserver(callback); document.onreadystatechange = () => { if (document.readyState === 'complete') { console.log("Loaded fully according to readyState") targetNode = document.getElementById('content-wrapper') console.log(targetNode) // Start observing the target node for configured mutations observer.observe(targetNode, config); } }
top of page

A summary of review paper Biological Sequence Classification

Updated: Dec 12, 2023

The original paper can be found here. All figures and tables are from paper.


Sequence-based prediction and analysis are the basic tasks to understand the biological functions of DNA, RNA, proteins and peptides.

Some data analysis methods include: Dynamic Time Wrapping (DTW) and Hidden Markov Model (HMM).



Protein homolog search


Protein homolog search


subcellular localisation classification


DNA sequence classification


Some widely used traditional ML algorithms in this field include Random Forest (RF), Support Vector Machine (SVM), Naïve Bayes(NB). Logistic Regression (LR), Decision Tree (DT), Light Gradient Boosting Machine (LGBM), extreme Gradient Boosting (XGBoost) and Deep Learning (DL).

Database from the review paper can be found here.

Machine Learning pipeline

The Machine Learning pipeline for the biological sequence classification requires the workflow of (a) a dataset construction, (b) sequence representation and feature selection, (c) model training and evaluation, (d) Software implementation. The last part is not covered in this summary.

Dataset construction

There are lots of online dataset libraries for biological data. Most of the amino acid sequence data can be obtained from UniProt. The CD-HIT software can be used to remove homologous sequence with a threshold between 30% to 90%. The lower the threshold, the lower the homology similarity sequence in the dataset.

Sequence representation and feature selection

There are several software platforms developed to generate features from dataset, including iLearnPlus, PyFeat, iFeature, BioSeq-Analysis 2.0, VisFeature and POSSUM.

For DNA and RNA sequence, there are 7 major encoding schemes:

nucleic acid composition

residue composition

position-specific tendencies of trinucleotides

electron-ion interaction pseudopotentials

autocorrelation and cross-covariance

physicochemical property

pseudo-nucleic acid composition

For proteins and peptides, the feature descriptors have 8 categories:

There is also an automatic encoding method using deep learning proposed by Wei which has been proved to be more informative and effective than the traditional methods. There are also other methods based on pre-trained encoding deep learning models which are successful such as unified representation (UniRep), Tasks Assessing Protein Embedding (TAPE), MULocDeep and BiLSTM embedding model.

After sequence representation, we also need to do feature selection to reduce dimensionality.

The most commonly used feature selection methods include:

analysis of variance


maximum relevance and minimum redundancy

Pearson correlation coefficient



Common feature extraction methods are shown here:

Model training and evaluation

Convolutional Neural Network and Recurrent Neural Network are frequently used for prediction research. CNN-based classifier Gene2vec was used for perdition of N6-methyladenosine (m6A) modification sites in mammalian messenger RNA sequences. bidirectional gated RNN based classifier BERMP was used for m6A modification sites in different species as well.

The commonly used evaluation methods include: k-fold cross-validation (k-fold CV), leave-one-out CV (LOOCV), and independent tests. The commonly used metrics include: Acc, Sp, Sn, MCC, ROC and BACC which is employed to measure the accuracy the overall performance of a model trained on imbalanced datasets.

Application in protein sequences

Protein are composed of 20 different types of amino acids. The classification tasks includes conventional unequal-length and equal-length sequences.

The conventional unequal length sequence types include:

DNA-binding proteins

RNA-binding proteins (RBPs)

secretory proteins


phage virion protein

cell wall lytic enzymes

thermophilic proteins

major histocompatibility complex

antioxidant proteins

biological luminescent proteins

electron transport proteins

plant pentatricopeptide repeat

sub-Golgi protein

type III fluctuation systems

The classification of protein equal-length sequences is related to the prediction of protein PTM sites.

For DNA-binding proteins, TargetDBP plus was a new classification tool which used weighted convolution features to encode sequence information.

For RBPs, there are three methods which were applied for the classification studies such as rBPDL (balanced), RBPro-RF, and TriPepSVM. In RBPs, there are positive and negative samples so they are binary classification with imbalanced data. SMOTE was used for imbalanced processing.

For cell wall lytic enzyme classification, there are three tools: CWLy-RF, CWLy-SVM and Jing's method.

For protein PTM sites classification of protein equal-length sequences, There is Phosphorylation predictors developed based on DeepPPSite and DeepPSP, DeepTL-Ubi, CNNAthUbi based on CNN adn ImbClassi_PTMs.

1 view0 comments
bottom of page