Galitsky B, Gelfand I., Kister A.
Mathematics Department, Rutgers University, Piscataway, NJ, 08854, USA; E-mail: galitsky@dimacs.rutgers.edu, igelfand@math.rutgers.edu, akister@math.rutgers.edu
Immunoglobulin (human heavy chain) sequences from Kabat database are analyzed in terms of keywords (motifs) of the small amino acid fragments(blocks). Representation of the sequences as the combination of 17 keywords of each fragments reveals that 6 principle combinations describe the majority of sequences (60%exactly, 40% with 1-3 fragments deviation). Furthermore, exhaustive sequence classification is built which relate a sequence to a class, subclass and sub subclass. The class determination is based on the residues in three positions and the subclass one is based on the residues in the other 8 positions. An important feature of this classification principle is that knowledge of few keywords, or even of the residues at several key positions, allows one to predict the residue or residue type in almost any position of a sequence. Classification graph is drawn with the following three levels: class-determining nodes (strand E), subclass-determining nodes (strand A) and sub subclass determining nodes (loops). Edges link the first with the second and the second with the third levels. Suggested classification is verified on the set of germline sequences. The keywords, which were obtained from the Kabat sequences, are found to be appropriate for the germline sequences in even higher degree. Germline sequences are split into the same classes and subclasses as Kabat sequences. The corresponding classification graphs for germline and Kabat sequences are similar except extra sub subclasses for the latter ones. It seems plausible, that under the natural sequence modification (somatic mutations) a sequence remains within the same class and subclass but could possible change its sub subclass. The purpose of this report is to predict the set of germline sequences given Kabat sequences for various immunoglobulin families. Comparison of the classification graphs for germline and Kabat sequences allowed to define a formal procedure of the transformation (simplification) of the latter graph into the former one. For each subclass all its nodes for sub subclasses are merged to become identical. The residues with the highest likelihood are assigned to the resultant nodes. (In other words, we chose the graph edges, which represent the higher number of sequences). The study included three following stages. The prediction algorithm was developed for the human heavy chains. The accuracy of the germline prediction is estimated in respect to the repertoire of the totality of reconstructed sequences and in respect to the individual sequence match for kappa and lambda chains of human immunoglobulin. The germline prediction results for the other immunoglobulin families, where there is no experimental data available, are presented at our homepage.