Segmentation of chromosome size DNA sequences into compositionally homogeneous domains using hidden Markov models

Peshkin L., Gelfand M.1

Brown University, Providence, RI 02912, USA, E-mail: ldp@cs.brown.edu

1Institute for Protein Research, 142292, Pushchino, Russia; E-mail: misha@imb.imb.ac.ru

This work presents an application of a machine learning for characterizing an important property of natural DNA sequences -- compositional inhomogeneity. Compositional segments often correspond to meaningful biological units. Taking into account such inhomogeneity is a prerequisite of successful recognition of functional features in DNA sequences, especially, protein-coding genes.

Here we present a technique for DNA segmentation using hidden Markov models. A DNA sequence is represented by a chain of homogeneous segments, each described by one of a few statistically discriminated hidden states, whose contents form a first-order Markov chain.

The technique is used to describe and compare chromosomes I, III and IV of the ompletely sequenced Saccharomyces cerevisiae genome. Our results indicate the existence of a few well separated states.

We also explore the model's likelihood landscape and analyze the dynamics of the optimization process, thus addressing the problem of reliability of the obtained optima and efficiency of the algorithms.