Ramensky V.E., Makeev V.Ju., Tumanyan V.G.
Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, 117984, Moscow, Vavilov St. 32, Russia
We consider the problem of DNA segmentation into the blocks of uniform nucleotide composition. In doing so, one have to overcome two main obstacles. First, it is a nontrivial task to determine what is a block in DNA sequence; it is difficult to separate two DNA regions in which nucleotides of all four types are present. Second, for the lower-sized blocks, containing only a few letters, the frequency-count compositional estimator is highly sensitive to nucleotide substitutions. We argue that the second problem may be solved by the Bayessian estimator of the composition. As an optimal segmentation we take that, for which the segmented sequence, or a set of blocks, has the highest probability to be generated through a series of independent tests with multinomial, constant within every block, probabilities of the nucleotide occurrences. Our approach yields the results consistent with the segmentation produced by the complexity DNA analysis for long sequences, but enables obtaining the short blocks in a straightforward way.
This work was supported by The Russian Human Genome Program.