Mean-recognition can increase the genome annotation accuracy

Ponomarenko M.P., Ponomarenko J.V., Podkolodnaya O.A., Frolov A.S., Vorobyev D.V., Kolchanov N.A., Overton G.C.1

Institute of Cytology & Genetics, 630090, Novosibirsk, Russia; FAX: +7(3832)356-558; E-mail: pon@bionet.nsc.ru;

1UPenn, Philadelphia, USA;

Functional site recognitions are key steps of genome annotation (Fickett, 1996). A number of approaches for the recognition have been so far developed; consensus and matrix approaches are still widely used (Gelfand, 1995). Recent evaluations of the genome annotation accuracy have shown the need in recognition accuracy increase (Burset, 1996; Fickett, 1997).Here, we demonstrate how it may be achieved by averaging the recognitions:

where Sab is DNA sequence of the region [a;b]; fk(S) is the kth partial recognition that is normalized by ; and FN is the mean-recognition of the N partial recognitions. For mean-recognizing, the generator of the consensuses and frequency matrices of a given functional site considered in terms of the overlapping oligomers from 1 bp to 6 bp in length defined in the alphabets , , , has been created. Applying this generator to the 77 GATA-1 sites, we created all the 15 significant consensuses and 18 matrices (i.e., the partial recognitions fk,G). The distributions of the partial recognitions in the 77 GATA-1 sites, PG(fk,G), and 1000 random DNA sequences, PR(fk,G), were drawn. These PG(fk,G) and PR(fk,G) were typically non-Gaussian, asymmetric, multiply peaking, and considerably overlapping. The heterogeneity reflects the multistep character of DNA/GATA-1 binding and decreases the partial recognitions accuracy. Then, the same distributions PG(FN,G) and PR(FN,G) were drawn for the mean-recognitions calculated for various sets from consensuses and matrices. It results (1) the distributions PR(FN,G) for random sequences became Gaussian when N>3; and (2) the distributions PG(FN,G) in the GATA-1 sites became symmetric at N>10 and Gaussian at N>30. Thus, the mean-recognitions became Gaussian-distributed with the growth of the total number N of the partial recognitions, as is stated by the Central Limit Theorem. To control whether the Central Limit Theorem is met for the YY1 site, we analyzed 27 sequences of this site and demonstrated that the YY1 site is also fitting this theorem. This pinpoints that any mean-recognitions FN are likely to meet the Central Limit Theorem. Since this Theorem establishes that the mean-recognition variance decreases with the total number N of the partial recognitions, the accuracy consequently increases. Thus, the mean-recognition offers a way for increasing the genome annotation accuracy.

This work was granted by NIH 2-R01-RR04026-08A2, Russian Human Genome, Russian Basic Research Foundation.