Consensus sampler for multiple DNA alignment

Lazareva B., Haussler D.

UCSC, SANTA Cruz, CA 95064, haussler@cse.ucsc.edu, FAX(408)459-4829

We solve a problem of finding the best consensus of non-specified length for a set of sequences assuming that the parameters of insertion, deletion and mismatch errors are fixed. Our algorithm performs MCMC sampling of consensus in the sequence space and finds the optimal and suboptimal consensus sequences. This setting is somewhat complimentary to the setting in profile HMM multiple alignment, where all the alignment parameters are learned from the data and the consensus length is fixed. The new algorithm, using MCMC but being inflexible in parameterization, is capable of exploring a wider set of consensus sequences without getting stuck in a local mode. This becomes crucial in the case when sequences are highly diverged and need many indels to be aligned to consensus.

The algorithm is shown to find the optimal consensus for sequences simulated with high mismatch and indel probabilities. On simulated data it appeared to be robust to the error rate parameters used in the algorithm. When the sequences were noisy (the error rate of about 0.1) consensus sampler was still able to restore the true consensus, while an HMM profile model trained with expectation maximization usually failed to do this. This is because if indels are a priori likely, a profile HMM model will get stuck in a local mode in which artificial, highly conserved match states are created by indels, and if indels are a priori unlikely, then an profile HMM model tends to misalign the sequences,and thus produce artifically dispersed distributions in match states, leading again to a local mode.

By allowing wild cards in the consensus, in particular 'N',the algorithm is capable of detecting conserved regions of the alignment. When tested on E.Coli promoters the algorithm found consensus for -35 and -10 regions and aligned sequences to those regions. For yeast introns the alignment produced conserved regions at branch point, 5' and 3' sites. Since indels do not play any role in these functional sites, the HMM alignment also succeeded in finding those conserved patterns.

We describe the method and illustrate it with examples.