Computational identification of structured cis-regulatory elements in the 3’UTRs of human protein-coding mRNAs

Xiaowei Sylvia Chen and Chris M Brown

Contact: xiaowei.chen @ otago.ac.nz

Supporting data for XS Chen, CM Brown (2012) Computational identification of new structured cis-regulatory elements in the 3’-UTR of human protein coding mRNAs. Nucleic Acids Research 40:8862-8873

Messenger RNAs contain a large number of cis-regulatory RNA elements that function in many types of post-transcriptional regulation. These cis-regulatory elements are often characterised by conserved structures and/or sequences. Although some classes are well known, given the wide range of RNA-interacting proteins in eukaryotes, it is likely that many new classes of cis-regulatory elements are yet to be discovered.

An approach to this is to use computational methods that have the advantage of analysing genomic data, particularly comparative data on a large scale. In this study a set of structural discovery algorithms were applied followed by support vector machine classification. We trained a new classification model (CisRNA-SVM) on a set of known structured cis-regulatory elements from 3'UTRs and successfully distinguished these, and groups of cis-regulatory elements not been strained on from control genomic and shuffled sequences.

The new method outperformed previous methods in classification of cis-regulatory RNA elements. This model was then used to predict new elements from cross-species conserved regions of human 3’UTRs. Clustering of these elements identified new classes of potential cis-regulatory elements.

The model, training and testing sets and novel human predictions are available here at: http://mRNA.otago.ac.nz/CisRNA-SVM.

 

CisRNA-SVM

All files and example (download)

------

Help-

perl CisRNA_SVM.pl -h

Input: -s filename - file containing a list of scores: RNAalifold (score), RNAz2.0 (Structural conservation index, SCI), RSmatch (average score), FoldalignM (normalized score), CMfinder (average score), Locarna (score) each line starts with alignment_name.start.end and scores separated by tab or empty space (e.g. IRE_test_score.txt)

-d directory where the executable svm-predict (from libsvm 2.9) is located e.g. /usr/local/bin

-a directory where the alignments are. Clustalw format, filename: alignment_name.extension and alignment_name contains no '.' e.g. /Users/myname/Documents/aln

-m filename - svm model e.g. CisRNA_SVM.model

Note: this requires the Tie::IxHash perl module and svm-predict

Output: predicted structured alignments are written to a directory (result_alignment) scores to a file: svm_prediction.txt and to sdout

------

Example of its use:

perl CisRNA_SVM.pl -s IRE_test_score.txt -m CisRNA_SVM.model -a aln -d /home/user/bin > IRE.log

example output: svm_prediction.txt, resultant alignment files.

Results

Elements from 154,803 subalignments of 120 nt overlapping (40nt) windows from 17,821 3'UTR alignments (file) selected from those previously used by TargetScan for miRNA target site prediction. These alignments were extracted by the from UCSC 28 way vertebrate genome multiz alignments by Freidman et al 2009. For this study ten diverse vertebrate species were chosen to be included in the alignments (Human, Mouse, Dog, Horse, Cow, Opossum, Platypus, Lizard, Chicken, Frog).

Set 1. Putative structured human 3'UTR cis-regulatory elements. High confidence set (Pp >0.9), 4424 elements, estimated False Positive Rate 0.0% (see paper)

Set 2. Putative structured human 3'UTR cis-regulatory elements. Medium confidence set (Pp>0.5), 22,038 elements, estimated False Positive Rate 6.3% (see paper)

For comparison we provide a track of the coordinates of the Rfam_scan matches to Rfam 10.1 models on version hg19 of the human genome (Rfam10.1) a track of Evofold predictions lifted over from hg17 by UCSC (Evofold) and a track of RNAz 1.0 predictions, lifted over from hg17 by ncRNA (RNAz). Intersections between these are also proveded (Evofold_Rfam, EvoFold_Rfam_CisRNA-SVM).

Training dataset

Positive dataset- from CisRegRNA Set A. Training negative dataset from Genomic alignments. Testing negative dataset - shuffled sequences. Testing positive dataset CisRegRNA B.

Last update 3/3/2013 by CMB.

Back to main page: bioanalysis.otago.ac.nz