Protein fold predictor based on global descriptors of amino acid sequence

Dubchak I., Muchnik I.1

E. O. Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA; Tel. (510)486-4338, Fax (510)486-6059, E-mail ildubchak@lbl.gov

1Center for Discrete Mathematics and Theoretical Computer Science, Rutgers University, Piscataway, NJ 08855-1179, USA

Predicting a protein fold and implied function from the amino acid sequence is a problem of great interest. We have developed a neural networks (NN) based expert system which, given a classification of protein folds, can assign a protein to a folding class using primary sequence data. It addresses the inverse protein folding problem from a taxonometric rather than threading perspective. Recent classifications suggest the existence of ~80-350 different folds. The occurrence of several representatives for each fold allows extraction of the common features of its members. Our method (i) provides a global description of a protein sequence in terms of the biochemical and structural properties of the constituent amino acids, (ii) combines the descriptors using NNs allowing discrimination of members of a given folding class from members of all other folding classes and (iii) uses a voting procedure among predictions based on different descriptors to decide on the final assignment. The level of generalization in this method is higher than in the direct sequence-sequence and sequence-structure comparison approaches. Two sequences belonging to the same folding class can differ significantly at the amino acid level but the vectors of their global descriptors will be located very close in parameter space. Thus, utilizing these aggregate properties for fold recognition has an advantage over using detailed sequence comparisons

All proteins in the non-redundant database of folds were transformed into inputs for the learning system in two steps:

(a) The sequence of amino acids was replaced by a sequence expressed in terms of their particular local physico-chemical or structural property, such as predicted secondary structure, predicted solvent accessibility, polarity, polarizability, van der Waals volume, and hydrophobicity;

(b) Three descriptors, "composition" (C), "transition" (T), and "distribution" (D), were calculated to describe the global composition of a given local amino acid property in the protein, the frequencies with which the property changes along the entire length of the protein, and the distribution pattern of the property along the sequence. The vectors of parameters containing 21 scalar components (C, T, and D combined), were constructed for all six properties to use as independent inputs to the NN. Percent composition of amino acids was also used as the parameter set.

In order to distinguish a particular fold from all other folds, seven neural networks (NNs) based on seven sets of parameters were trained accordingly. In such a way, any sequence in question had seven individual predictions. A majority rule was used in decision making. This procedure is simple, efficient, and incorporated into easy-to-use-software. It was applied to the fold predictions in the context of fine-grained classifications 3D_ALI [1] and the Structural Classification of Proteins, SCOP [2]. In attempt to simplify the fold recognition problem and to increase the reliability of predictions, we also approached a reduced fold recognition problem, when the choice is limited to two folds. Our prediction scheme demonstrated high accuracy in extensive testing on the independent sets of proteins.

  1. Pascarella, S., Argos, P. (1992). Prot. Engng., 5: 121-137
  2. Murzin, A. G., S. E. Brenner, T. Hubbard and C. Chothia. (1995). J. Molec. Biol., 247: 536-540.