The research paper can be found here.
In this paper, the authors introduced a protein sequences segmentation methods called peptide-pair encoding (PPE), a general-purpose probabilistic segmentation of protein sequences into commonly occurring variable-length sub-sequences. This method was inspired by the byte-pair encoding (BPE) text compression algorithm in Natural Language Processing.
This PPE method was used as the input to the downstream machine learning tasks: DiMotif and ProtVecX. The result showed that the new method of embedding outperforms the others in some cases. Here, I am focusing on protein classification tasks with another protein sequence embedding method, ProtVecX, trained on variable-length segments of protein sequences.
This approach was addressed as a general purposed embedding method as the other segmentation tasks were combined with sequence tasks such as secondary structure prediction, or binding site prediction.
Peptide-pair encoding
The input to the PPE algorithm is a set of sequences and the output would be segmented sequences and segmentation operations, an ordered list of amino acid merging operations to be applied for cementing new sequences. At the beginning of the algorithm, each sequence is a list of amino acids. It then search for the most frequently occurring pair of adjacent amino acids in all input sequences. In the next step, the selected pairs of amino acids are replaced by the merged version of the selected pair as a new symbol (a short peptide) This process is continued until it could not find a frequent pattern or it reaches a certain vocabulary size.
The system is trained through the high quality Swiss-Prot database, which contains over 500k protein sequences. The single segmentation scheme is helpful in some case, It might also result in ignoring important information for the task of interest. Hence, the PPE method proposed a sampling framework for estimating the segmentation of a sequence in a probabilistic manner.
Result
Protein classification results for venom toxins, subcellular location, and enzyme predictions using deep MLP neural network on top of different combinations of features are provided as below.
It looks like there is not much difference in the result with different representations.
Comments