Original paper can be found here.
There are two main features about protein which were used for the classification. One is sequence information.
Protein are made from 20 different type of amino acid. These acids form up a molecular chain in different order, direction and numbers with the total acid number ranging from 50 to over 1500. The protein less than 1500 acids take part 98% of the protein found. Hence the majority protein can be represented by a matrix of 1500 by 20 where the 1500 is the maximum number of acid number and the 20 is one-hot encoding of each amino acid.
There are many feature extraction methods which are based on sequence information and used for the classification:
The pseudo-amino acid composition (PseAAC)
Local amino acid composition (LAAC)
Local dipeptide composition (LDC)
Global descriptor (GD)
Lempel-Ziv complexity (LZC)
Autocorrelation descriptor (AD)
Sequence-order descriptor (SD)
Hilbert-Huang transform (HHT)
Peptide composition method
Dipeptide composition (DipC)
Tripeptide composition (TipC)
This information mainly refers to the position-specific scoring matrix (PSSM). It forms up a matrix of 1500 20 where the 20 refers to the 20 scoring value of one acid to all 20 acids.
When protein does not have 1500 acids, 0 were added as the padding for both features.
This paper proposed two deep learning methods to train the sequence features and evolutionary features separately for protein type classification. The last layer of these two deep learning methods are both softmax function which give a value of probability of their output. These probability value were then combined and fed to a meta-classifier to train for the final output.
Sequence feature model
One convolutional layer, one bi-LSTM, followed by softmax function.
Evolutionary feature model
Two convolutional layer with average pooling, followed by PrimaryCaps.
Claimed better result especially with another feature extraction method with Auto-encoder.
What is PrimaryCaps
It is functioning very similar to Convolutional Neural Network (CNN). In the case of CNN, the location information of each "pixel" is gradually lost due to the convolution and sometimes, the pooling. Primary caps starts to capture these information by representing the node information using vector instead of scalar value. For example, one node contains a scalar value of . We then create one capsule which contains 8 node to represent a vector of 8 dimension. This is called the PrimaryCaps. The weighted sum of the PrimaryCaps will then be used to decide the protein type in this case. If the information of one location take a major contribution in deciding the protein type. The weight of this capsule will increase gradually during training. This works very similar to attention mechanism in sequence-to-sequence model.