Evolutionary Classification of Protein Domains: From Remote Homology to Family



Journal Title

Journal ISSN

Volume Title



Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. A protein domain classification splits protein into domains and organizes them according to their evolutionary history. Existing classification databases fall back the speed of protein structure determination and do not include some known homologous relationships. I have participated in creating a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures and developed a website for easy access and searches with keyword, sequence or structure (http://prodata.swmed.edu/ecod). ECOD (Evolutionary Classification Of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or fold). Our database uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary relationships among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. The classification is assisted by an automated pipeline that classifies the most of new structures in Protein Data Bank weekly. This synchronization uniquely distinguishes ECOD among all protein classifications. For proteins that lack confident results from the automatic pipeline, I rely on information from literature, sequence and structure similarity scores, visual comparison and experience to classify them manually. I document the manual curation process in detail with an example of the remote homology between an autoproteolytic domain found in GPCR-Autoproteolysis Inducing domain, ZU5 and nucleoporin98. ECOD also recognizes closer relationships at the family level, initially with Pfam families. However, existing family databases do not cover all structures and disagree with ECOD in terms of domain definition and boundary. I generate multiple sequence alignment and profile for domains in the same family with structural information and demonstrate that the alignment quality is similar to manually checked Pfam seed alignments. I compare ECOD family profiles with Pfam and Conserved Domain Database and discuss about the improvement of domain boundary over known families and the dominance of small families in new families.

General Notes

Table of Contents


Related URI