Eisenhaber F.1,2, Huynen M.1,2, Orengo Ch.3, Sunyaev Sh.4, Yuan Ya.1,2, Bork P.1,2
1EMBL Heidelberg, Meyerhofstr. 1, D-69012 Heidelberg, Germany
2Max-Delbrueck-Centrum fuer Molekulare Medizin, Robert-Roessle-Str. 10, D-13122 Berlin-Buch, Germany
3University College, Dept. Biochemistry & Molecular Biology, London, U.K.
4Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 117984, Moscow, Vavilov Street 32, Russia
As a result of large-scale genomic sequencing, the gene sequences of many proteins are known but their structure and function is hardly understood. The major source of hypothetical information about such uncharacterized proteins is sequence comparison and the extrapolation of annotated information from homologous proteins.
Iterative homology searches have been used to assign folds to the protein sequences derived from the complete genomes of M. genitalium, E. coli, and M. jannaschii. Protein sequence segments supposed to represent coiled coil and transmembrane regions excluded from the analysis. The procedure resulted the fold assignment for at least one domain in 30-40% of all proteins of M. genitalium, E. coli, and M. jannaschii. The accuracy of this prediction appears 98% as estimated from iterative homology searches for the 685 sequences of a maximal subset of non-homologous proteins extracted from the PDB.
Annotations of proteins in databases are generally written for a human reader and use a wide variety in terminology for a detailed description of phenomena. Often, the user is interested in a more coarse-grain classification of proteins. The assignment of cellular localization (attributes "extracellular", "intracellular", and "membrane-related") is an example of such a problem. A solution as exemplified by the META_A(annotator) software system consists in a computer program able to evaluate the annotation with systems of biological rules encoded in form of regular expressions. Applied to the problem of subcellular localization, it was possible to assign at least one of the three attributes to more than 88% of all SWISS-PROT entries. An application to the M. genitalium sequences revealed the probable absence of purely extracellular proteins in this organism.