Journal of Molecular Biology
Volume 313, Issue 5, 9 November 2001, Pages 1003-1011
Journal home page for Journal of Molecular Biology

Regular article
Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles1

https://doi.org/10.1006/jmbi.2001.5102Get rights and content

Abstract

We present here a new approach to the problem of defining RNA signatures and finding their occurrences in sequence databases. The proposed method is based on “secondary structure profiles”. An RNA sequence alignment with secondary structure information is used as an input. Two types of weight matrices/profiles are constructed from this alignment: single strands are represented by a classical lod-scores profile while helical regions are represented by an extended “helical profile” comprising 16 lod-scores per position, one for each of the 16 possible base-pairs. Database searches are then conducted using a simultaneous search for helical profiles and dynamic programming alignment of single strand profiles. The algorithm has been implemented into a new software, ERPIN, that performs both profile construction and database search. Applications are presented for several RNA motifs. The automated use of sequence information in both single-stranded and helical regions yields better sensitivity/specificity ratios than descriptor-based programs. Furthermore, since the translation of alignments into profiles is straightforward with ERPIN, iterative searches can easily be conducted to enrich collections of homologous RNAs.

Introduction

Protein motifs can be efficiently identified in sequence databases due primarily to the development of sophisticated amino acid substitution models that detect functional signatures even in conditions of poor sequence conservation. In comparison, the field of RNA detection lies in a primitive stage. Nucleotide bases do not carry as much functional information as amino acid residues and the structure (hence the function) of an RNA molecule is defined by distant interactions as well as by the linear sequence. Therefore, neither sophisticated substitution models nor the classical sequence alignment procedures can be applied to RNA detection. These obstacles have been circumvented in different ways. First, specific programs have been developed for the detection of particular RNA molecules, such as tRNA 1, 2 or group I intron3. Since this approach obviously lacked flexibility, computer programs were developed enabling biologists to describe RNA motifs using a special language. Several descriptor languages and search engines have been devised4, 5, 6, 7, allowing the specification of RNA elements such as helices and single strands, as well as sequence constraints. Although these programs are commonly used in RNA motif searches, their effectiveness strongly depends on our understanding of an RNA’s sequence/structure requirements. Subtle sequence constraints in helices or single strands are easily overlooked, causing insufficient specificity or, on the contrary, constraints might be overtightened and cause the program to fail on some unusual cases. A correct balance between specificity and sensitivity is better achieved using a statistical model of the RNA sequences under study. Programs based on Stochastic Context Free Grammars (SCFG) derive such a statistical model automatically from sequence data in the form of sets of production rules and their associated probabilities8, 9. SCFG have been successful in helping to identify new snoRNAs 10 but practical limitations (no support for pseudoknots, heavy computational demand) have limited their use in practice.

We present here an original approach to RNA signature detection that does not require writing descriptors, and yet permits a fast and accurate motif definition and identification. The program, named ERPIN (Easy RNA Profile IdentificatioN), is based on the principle of lod-score profiles generalized to base-paired regions. A sequence alignment and secondary structure annotation is required as an input. We will show applications of ERPIN to tRNA loops, Selenocysteine Insertion Elements and a protein-bound fragment of ribosomal RNA. An example of iterative search is also shown, using the Iron Response Element.

Section snippets

Secondary structure profiles and the ERPIN algorithm

The input of ERPIN is an RNA sequence alignment annotated with secondary structure information. A log-odds-score (lod-score) profile is constructed for each helix and single strand in the alignment. Helix profiles are 16-row matrices with a lod-score for each possible base-pair, while single strand profiles are generally five-row matrices with lod-scores for the four bases and the gap character (see the precise definitions of profiles in Materials and Methods). For example, the simplest

Conclusion

We have presented a practical program for the automatic derivation of an RNA signature from a sequence alignment and secondary structure. The signature has the form of a statistical secondary structure profile, an adaptation of the well known lod-score profile. An important advantage of statistical profiles is their ability to capture biases that escape human inspection. Such biases occur very frequently in single-stranded or base-paired regions. According to the rRNA base-pair frequency tables

Single-strand profiles

Gap-containing single strands are represented by a classical lod-score matrix with five scores per position (one for each base type, and one for gaps). The score for observing a given base at position i is:Si=logOiEI Where Oi and Ei are the observed and expected frequencies of this base at position i, respectively. Expected base frequencies are those in the target database, which compensates for possible compositional biases in the database. Contrarily to usual sequence alignment procedures,

Acknowledgements

We thank Benjamin Wainstain for stimulating discussions about the program.

References (24)

  • G. Pesole et al.

    PatSearcha pattern matcher software that finds functional elements in nucleotide and protein sequences and assesses their statistical significance

    Bioinformatics

    (2000)
  • Y. Sakakibara et al.

    Stochastic context-free grammars for tRNA modeling

    Nucl. Acids Res.

    (1994)
  • Cited by (239)

    • Key players in regulatory RNA realm of bacteria

      2022, Biochemistry and Biophysics Reports
    View all citing articles on Scopus
    1

    Edited by J. Doudna

    View full text