Pfam is a protein domain family database. In the database, each sample is a segmentation of protein, domain. Based on the characterisation of the protein segments, these domains were classified into different families.
There are two parts of this Pfam database: part A and part B. Pfam-A is curated and contains well-characterised protein domain families with high quality alignments. They were manually checked seeded alignments. All members of the seed were also aligned using Hidden Markov Model (HMM). Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains.
What is seeded alignment?
When we align two two sequence, The general patterns of these two sequence will be matched with filled gaps between the pattern. The patterns here are commonly called seeds. Seeds have been used not only for large-scale local alignment but also as anchor points in whole-genome and multiple sequence alignment algorithms.
What does Domainer algorithm do?
The Domainer algorithm performs clustering of domain families based on all versus all Blastp matching. It is a fully automatic approach that was used for building the ProDom database. The clustering level of Domainer depends on the score level of accepted pairwise Blastp matches. The domain boarders are inferred by analysing the extent of the BLAST matches and from the NH- and COOH-terminal ends. The main problem with this method is that it does not scale well and it is sensitive to incorrect data.
Pfam-A
In the Pfam-A database, we have aligned protein sequence data as well as the original sequence. Figure below showed us the pattern of seeds in one protein domain family. Each colour is a numerically encoded unit, amino acid, and the number 0 is the filler gaps between seeds.
Comments