Sinoquet Ch.
IRISA - Campus de Beaulieu, 35042 Rennes cedex France, E-mail: sinoquet@irisa.fr, phone number: 33 02 99 84 75 93, fax number: 33 02 99 84 71 71
The patterns we focus on are described as the repetition of one or more entities, each entity consisting in an unknown sub-pattern or a morphic transformation of it. Such a transformation sf (s in {+,}, f known morphism) is specified by (s = +, direct transf.) or (s = -, inverse transf.) (c is a character, w is a word.). Our method takes account of evolution impact on genome (mutations, indels). The tool we implemented automatically generates the sequence parser dedicated to a given specification. The complex pattern is specified by means of a discontinuous grammar, the symbols being implicitly separated by gaps, these uninteresting regions of the sequence. We allow ambiguity (alternative rules) and recursivity. The grammars class we deal with is able to capture either context-free aspects (inversion) or context-sensitive ones (repetition). For this purpose, beside terminal and non-terminal symbols, we introduce (global) string variables (SV). For example, an inversion is specified by , where S is the axiom, X a SV and id identity. The recognition problem is considered as the combination of a derivation from axiom and an instantiation of the entities in this derivation. Three challenges faced us : the first one consisted in processing with same efficiency direct and inverse transformations, whose complexities are different (Chomsky hierarchy). Then, in the concerned domain, the few parsers designed deal with perfect match or substitution errors. As a second challenge, we allow mutations and indels. The third challenge concerns the possibility to identify a pattern composed of (possibly composed) sub-patterns repeated in a finite but undetermined number of examplars. To reach the two first objectives, we preprocess the sequence. This step consists in maximal use of constraints mentionned in the specification grammar : semantical constraints (morphic transformations), string variables lengths, spatial local constraints (absolute position in sequence), spatial inter-variables constraints (relative position), content constraints (group(s) of known characters localized at known distance from a given variable). Preprocessing results in dictionaries compiling relevant sub-words of the sequence, that is, sub-words that may be involved in the instantiation of some . Such an operation is linear in time with sequence length when no error is allowed. It lays on the morphic transformation principle: = = = = (if s=+); = = = = (if s=). Model M' is an extension of model M ( with if s=+ (resp. if s=)). is an exact occurrence for model M' modulo sf. Our method does not scan gaps : it is adapted to discontinuous grammars. When we consider different maximal error thresholds e, there exists a range of sequence lengths for which both time and memory complexities for approximate occurrences lists construction remain reasonable. Absolute constraint filtering is performed during construction of dictionaries. Relative constraints are taken into account once all occurrences have been generated. In our applications (nucleotidic sequences), we searched for complex patterns based on recursivity, ambiguous patterns (alternative regions), tertiary structures (pseudo-knots) and clover-leaf structures. In this last example, we tested sequences lengths from 300 (e=6) to 30000 (e=1) (and even 300000 (e=0)), and observed that if e is sufficiently high (), considering only indels still allows us to localize a structure with mutations and indels: the multiple pattern instantiations generated refer to the same precise region of the sequence, with internal differences.