Journal of Molecular Biology
Regular articleDirect RNA motif definition and identification from multiple sequence alignments using secondary structure profiles1
Introduction
Protein motifs can be efficiently identified in sequence databases due primarily to the development of sophisticated amino acid substitution models that detect functional signatures even in conditions of poor sequence conservation. In comparison, the field of RNA detection lies in a primitive stage. Nucleotide bases do not carry as much functional information as amino acid residues and the structure (hence the function) of an RNA molecule is defined by distant interactions as well as by the linear sequence. Therefore, neither sophisticated substitution models nor the classical sequence alignment procedures can be applied to RNA detection. These obstacles have been circumvented in different ways. First, specific programs have been developed for the detection of particular RNA molecules, such as tRNA 1, 2 or group I intron3. Since this approach obviously lacked flexibility, computer programs were developed enabling biologists to describe RNA motifs using a special language. Several descriptor languages and search engines have been devised4, 5, 6, 7, allowing the specification of RNA elements such as helices and single strands, as well as sequence constraints. Although these programs are commonly used in RNA motif searches, their effectiveness strongly depends on our understanding of an RNA’s sequence/structure requirements. Subtle sequence constraints in helices or single strands are easily overlooked, causing insufficient specificity or, on the contrary, constraints might be overtightened and cause the program to fail on some unusual cases. A correct balance between specificity and sensitivity is better achieved using a statistical model of the RNA sequences under study. Programs based on Stochastic Context Free Grammars (SCFG) derive such a statistical model automatically from sequence data in the form of sets of production rules and their associated probabilities8, 9. SCFG have been successful in helping to identify new snoRNAs 10 but practical limitations (no support for pseudoknots, heavy computational demand) have limited their use in practice.
We present here an original approach to RNA signature detection that does not require writing descriptors, and yet permits a fast and accurate motif definition and identification. The program, named ERPIN (Easy RNA Profile IdentificatioN), is based on the principle of lod-score profiles generalized to base-paired regions. A sequence alignment and secondary structure annotation is required as an input. We will show applications of ERPIN to tRNA loops, Selenocysteine Insertion Elements and a protein-bound fragment of ribosomal RNA. An example of iterative search is also shown, using the Iron Response Element.
Section snippets
Secondary structure profiles and the ERPIN algorithm
The input of ERPIN is an RNA sequence alignment annotated with secondary structure information. A log-odds-score (lod-score) profile is constructed for each helix and single strand in the alignment. Helix profiles are 16-row matrices with a lod-score for each possible base-pair, while single strand profiles are generally five-row matrices with lod-scores for the four bases and the gap character (see the precise definitions of profiles in Materials and Methods). For example, the simplest
Conclusion
We have presented a practical program for the automatic derivation of an RNA signature from a sequence alignment and secondary structure. The signature has the form of a statistical secondary structure profile, an adaptation of the well known lod-score profile. An important advantage of statistical profiles is their ability to capture biases that escape human inspection. Such biases occur very frequently in single-stranded or base-paired regions. According to the rRNA base-pair frequency tables
Single-strand profiles
Gap-containing single strands are represented by a classical lod-score matrix with five scores per position (one for each base type, and one for gaps). The score for observing a given base at position i is: Where Oi and Ei are the observed and expected frequencies of this base at position i, respectively. Expected base frequencies are those in the target database, which compensates for possible compositional biases in the database. Contrarily to usual sequence alignment procedures,
Acknowledgements
We thank Benjamin Wainstain for stimulating discussions about the program.
References (24)
- et al.
Identifying potential tRNA genes in genomic DNA sequences
J. Mol. Biol.
(1991) - et al.
Automatic identification of group I intron cores in genomic DNA sequences
J. Mol. Biol.
(1994) - et al.
A common motif organizes the structure of multi-helix loops in 16 S and 23 S ribosomal RNAs
J. Mol. Biol.
(1998) - et al.
Singly and bifurcated hydrogen-bonded base-pairs in tRNA anticodon hairpins and ribozymes
J. Mol. Biol.
(1999) - et al.
New mammalian selenocysteine-containing proteins identified with an algorithm that searches for selenocysteine insertion sequence elements
J. Biol. Chem.
(1999) - et al.
Novel selenoproteins identified in silico and in vivo based on an RNA structural tag
J. Biol. Chem.
(1999) - et al.
tRNAscan-SEa program for improved detection of transfer RNA genes in genomic sequence
Nucl. Acids Res.
(1997) - et al.
Pattern searching/alignment with RNA primary and secondary structuresan effective descriptor for tRNA
Comp. Appl. Biosci.
(1990) - et al.
An RNA pattern matching program with enhanced performances and portability
Comp. Appl. Biosci.
(1994) - et al.
Palingola declarative programming language to describe nucleic acids’ secondary structures and to scan sequence database
Nucl. Acids Res.
(1996)
PatSearcha pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance
Bioinformatics
Stochastic context-free grammars for tRNA modeling
Nucl. Acids Res.
Cited by (239)
Key players in regulatory RNA realm of bacteria
2022, Biochemistry and Biophysics ReportsAntagonistic interactions between phage and host factors control arbitrium lysis–lysogeny decision
2024, Nature Microbiology
- 1
Edited by J. Doudna