The original paper can be found here. All figures and tables are from paper.
What is UniRep?
In this paper, the author applied deep learning to unlabelled amino-acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily and biophysically grounded. The data-driven approach which is a unified representation of sequence-based protein is a unified representation (UniRep).
The UniRep is a recurrent neural network (RNN), more specifically, a multiplicative long/short term memory (mLSTM) RNN method. It summarises arbitrary protein sequences into fixed-length vectors approximating fundamental protein features. This length vector representation is the globally averaging intermediate mLSTM numerical summaries (the hidden states)
How to use UniRep
The original UniRep code is published on GitHub. There is also a JAX version, which is easily customisable, with additional utility APIs that support protein engineering workflows.
Here, I am using the original UniRep code in a docker container. The pipeline in UniRep code is as follows:
set up model with trained weights. the final hidden layer has either 1900 or 64 units.
Import protein sequence
Check if sequence contain any invalid characters
Transform sequence character to integer numbers.
pad each sequence to a fixed length in batch.
Extract the final hidden layer tensor from the trained mode.
Use this final hidden layer tensor as input, train for the top model weights, or train all weights.
Comments