Sinoquet Ch.
IRISA - Campus de Beaulieu, 35042 Rennes cedex France, E-mail: sinoquet@irisa.fr, phone number: 33 02 99 84 75 93, fax number: 33 02 99 84 71 71
The patterns we focus on are described as the repetition of one or more
entities, each entity consisting in an unknown sub-pattern or a morphic
transformation of it. Such a transformation sf (s in {+,},
f known morphism) is specified by
(s = +, direct transf.) or
(s = -, inverse transf.) (c is a character, w is a word.).
Our method takes account of evolution impact on genome (mutations, indels).
The tool we implemented automatically generates the sequence parser dedicated
to a given specification. The complex pattern is specified by means of
a discontinuous grammar, the symbols being implicitly separated by gaps,
these uninteresting regions of the sequence. We allow ambiguity (alternative
rules) and recursivity. The grammars class we deal with is able to capture
either context-free aspects (inversion) or context-sensitive ones (repetition).
For this purpose, beside terminal and non-terminal symbols, we introduce
(global) string variables (SV). For example, an inversion is specified
by
, where S is the
axiom, X a SV and id identity. The recognition problem is
considered as the combination of a derivation from axiom and an instantiation
of the
entities in this derivation.
Three challenges faced us : the first one consisted in processing with
same efficiency direct and inverse transformations, whose complexities
are different (Chomsky hierarchy). Then, in the concerned domain, the few
parsers designed deal with perfect match or substitution errors. As a second
challenge, we allow mutations and indels. The third challenge concerns
the possibility to identify a pattern composed of (possibly composed) sub-patterns
repeated in a finite but undetermined number of examplars. To reach the
two first objectives, we preprocess the sequence. This step consists in
maximal use of constraints mentionned in the specification grammar : semantical
constraints (morphic transformations), string variables lengths, spatial
local constraints (absolute position in sequence), spatial inter-variables
constraints (relative position), content constraints (group(s) of known
characters localized at known distance from a given variable). Preprocessing
results in dictionaries compiling relevant sub-words of the sequence, that
is, sub-words that may be involved in the instantiation of some
.
Such an operation is linear in time with sequence length when no error
is allowed. It lays on the morphic transformation principle:
=
=
=
=
(if s=+);
=
=
=
=
(if s=). Model M'
is an extension of model M (
with
if s=+ (resp.
if s=)).
is an exact occurrence
for model M' modulo sf. Our method does not scan gaps : it is adapted
to discontinuous grammars. When we consider different maximal error thresholds
e, there exists a range of sequence lengths for which both time
and memory complexities for approximate occurrences lists construction
remain reasonable. Absolute constraint filtering is performed during construction
of dictionaries. Relative constraints are taken into account once all occurrences
have been generated. In our applications (nucleotidic sequences), we searched
for complex patterns based on recursivity, ambiguous patterns (alternative
regions), tertiary structures (pseudo-knots) and clover-leaf structures.
In this last example, we tested sequences lengths from 300 (e=6)
to 30000 (e=1) (and even 300000 (e=0)), and observed that
if e is sufficiently high (
),
considering only indels still allows us to localize a structure with mutations
and indels: the multiple pattern instantiations generated refer to the
same precise region of the sequence, with internal differences.