Chapter 1
Methods for Selecting Effective siRNA Sequences by Using Statistical and Clustering Techniques
Shigeru Takasaki
Abstract
Short interfering RNAs siRNAs have been widely used for studying gene functions in mammalian cells but vary markedly in their gene-silencing efficacy. Although many design rulesguidelines for effective siRNAs based on various criteria have been reported recently, there are only a few consistencies among them. This makes it difficult to select effective siRNA sequences targeting mammalian genes. This chap-ter first reviews the reported siRNA design guidelines and clarifies the problems concerning the current guidelines. It then describes the recently reported new scoring methods for selecting effective siRNA sequences by using statistics and clustering techniques such as the self-organizing map SOM and the radial basis function RBF network. In the proposed three methods, individual scores are defined as a gene degradation measure based on position-specific statistical significances. The effectiveness of the methods was confirmed by evaluating effective and ineffective siRNAs for recently reported genes and comparison with other reported scoring methods. The sizes values of these scores are closely correlated with the degree of gene degradation, and the scores can easily be used for selecting high-potential siRNA candidates. The evaluation results indicate that the proposed new methods are useful for selecting siRNA sequences targeting mammalian mRNA sequences.
Key words: siRNA design , RNA interference , gene silencing , SOM classification , statistical significance , RBF network .
1. Introduction
Although RNA interference RNAi has been successfully used for studying gene functions in both plants and invertebrates, many practical obstacles need to be overcome before it becomes an established tool for use in mammalian systems 1? 6 . One of the important problems is designing effective short interfering RNA siRNA sequences for target genes. The siRNA responsible for
M. Sioud ed., Methods in Molecular Biology, siRNA and miRNA Gene Silencing, vol. 487 . Humana Press, a part of Springer Science + Business Media, LLC 2009 DOI: 10.1007978-1-60327-547-7_1
1
Takasaki
RNA interference varies markedly in its gene-silencing efficacy in mammalian genes, where the gene-silencing effectiveness depends very much on the target sequence positions sites selected from the target gene 7,8. Since different siRNAs synthesized for vari-ous positions induce different levels of gene silencing, the selec-tion of the target sequence is critical for the effectiveness of the siRNA. We therefore need useful criteria for gene-silencing effi-cacy when we design siRNA sequences 9,10.
Zamore et al. and Jayasena et al. showed that the 5 ′ end of the antisense strand might be incorporated into the RNA-induced silencing complex RISC. Strand incorporation may depend on weaker base pairing and thus an A?T terminus may lead to more strand incorporation than a G?C terminus 11,12. Other factors reported to be related to gene-silencing efficacy are GC content, point-specific nucleotides, specific motif sequences, and secondary structures of mRNA. Several siRNA design rulesguidelines using efficacy-related factors have been reported 13 ?17.
Although the positional nucleotide characteristics for siRNA designs seem to be the most important factor determining effec-tive siRNA sequences, there are few consistencies among the reported rulesguidelines 18?23. This implies that these rules guidelines might result in the generation of many candidates and thus make it difficult to extract a few for synthesizing siR-NAs. Furthermore, there is in RNAi a risk of off-target regula-tion: a possibility that the siRNA will silence other genes whose sequences are similar to that of the target gene. When we use gene silencing for studying gene functions, we have to first some-how select high-potential siRNA candidate sequences and then eliminate possible off-target ones 24.
This chapter first reviews the reported siRNA design guide-lines and clarifies the problems concerning the reported guide-lines. It then describes the recently reported new scoring methods for selecting effective siRNA sequences by using statistical and clustering techniques 25?32.
In the statistical method, many effective siRNA sequences are examined in the literature 31, because it can be hypoth-esized that position-specific nucleotides play important roles in gene-silencing efficacy. If specific features of nucleotide frequen-cies appeared in many effective siRNAs, they mean the positional nucleotide characteristics for siRNA designs. The features of nucleotide frequencies at individual positions are then analyzed by using the statistical significance test. As these features can be considered as new guidelines, a measure score for select-ing effective siRNA candidates is defined based on the positional features of specific significant nucleotides. The effectiveness of the proposed measure was confirmed by comparing the com-puted scores with those of the recently reported other selection methods 28,29,31.
2. RNA Interfer-ence and siRNA Sequence Selec-tion Problem
2.1. RNA Interference
Methods for Selecting Effective siRNA Sequences
The chapter then describes how to extract individual nucle-otide features from many effective siRNA sequences by using mathematical clustering techniques ? the SOM and the RBF network see later Sects. 3.2.1 and 3.2.2 25?27. In the SOM-based clustering method, siRNA classification from many effective siRNAs is first described. It is then shown how posi-tional nucleotide features are extracted from the classified groups and is demonstrated how the extracted features are integrated as a measure score. It is finally confirmed that the SOM method is effective by evaluating the relations between the scores and effec-tiveineffective siRNAs reported in the literature and comparing them with those of other reported scoring methods 30,33.
In the RBF-network-based method, after the siRNA classi-fication is carried out by using the RBF network 25,26, the preferred and unpreferred nucleotides for effective siRNAs at individual positions are determined by significance testing and are used to calculate a score that measures a sequence’s poten-tial for gene degradation. The effectiveness of the proposed scor-ing method was then confirmed by using it to evaluate RNA sequences recently reported to effectively or ineffectively sup-press the expression of various genes see later subsection and comparing it with other scoring methods 32,33.
As a result of various evaluations, it is found there are good correlations between the sizes values of the proposed individual scores and the effectiveness and ineffectiveness of the recently reported siRNA sequences. The evaluation results indicate that the three methods would be useful for selecting siRNA sequences for mammalian genes.
RNA interference RNAi is a phenomenon that silences gene expression by introducing double-stranded RNA dsRNA homologous to the target mRNA 1. After this phenomenon was discovered in the nematode Caenorhabditis elegans, it gradually became clear that similar phenomena occur in the cells of plants, fungi, and mammals 1?6. RNAi has been reported to result from the following sequence of events 2,5,6. Long dsRNA is first cleaved into siRNA species by an RNAase III enzyme, Dicer. These siRNAs are then incorporated into an RNA-induced silenc-ing complex RISC, where the duplex siRNA is unwound so the antisense strand can guide RISC to the target mRNA having the complementary sequence. Finally, the target mRNA is cleaved at a single site in the center of the duplex region between the guide siRNA and the target mRNA 28. Among the events that are
Takasaki
2.2. siRNA Sequence Selection Problem
2.2.1. Related Works Regarding the siRNA Sequence Selection Problem
2.2.2. The Reported Guide-lines for siRNA Sequence Design
still unclear, however, are the mechanism of the target mRNA cleavage and the mechanism by which the center of the duplex region in the RISC is identified. Furthermore, although RNAi has been widely used for studying gene functions, the effectiveness with which the genes in mammalian cells can be silenced this way depends very much on target sequence positions sites selected from the target gene. That is, different siRNAs synthesized from various positions induce different levels of gene silencing. This indicates that the selection of the target sequence position site is critical for the effectiveness of the siRNA 3?6.
To use RNAi as a biological tool for mammalian cell experiments, we first need to identify target sequences causing gene degrada-tion. They have so far been identified by using a trail-and-error method 3,8, but siRNAs extracted from different regions of the same gene have varied remarkably in their effectiveness. The difficulty of using the trail-and-error method to select tar-get sequences causing gene silencing increases when the coding regions are long, as they are in mammalian cells. This is because the larger the number of candidates becomes, the more difficult it is to get gene-silencing candidates.
The earliest guidelines for siRNA sequence design were proposed by Elbashir et al. 4,8,40. They suggested that synthesizing siRNA duplexes of 21 nucleotides nt length ? 19 nt base-paired sequence with 2 nt 3′ overhang at the ends ? mediates efficient cleavage of the target mRNA. Their rules are summarized as follows.
1
Select the target region from the open reading frame ORF of a given cDNA sequence preferably 50?100 nt down-stream of the start codon. Avoid 5 ′ or 3′ untranslated regions UTRs or regions close to the start codon as these may be richer in regulatory protein-binding sites.
2
Search for sequences 5 ′-AAN19UU, where N is any nucleotide, in the mRNA sequence and choose those with approximately 50% GC content. Highly G-rich sequences should be avoided because they tend to form G-quar-tet structures. If there are no 5 ′-AAN19UU motifs present in the target mRNA, search for 5 ′-AAN21 or 5′-NAN21, and synthesize the sense siRNA as 5′-N19 TT and the antisense siRNA as 5′-N′19TT, where N ′19 denotes the reverse complement sequence of N19 and T indicates 2′-deoxythymidine.
3
Blast-search the selected siRNA sequences against EST libraries or mRNA sequences of the respective organism to ensure that only a single gene is targeted.
4
It may be advisable to synthesize several siRNA duplexes to control for the specificity of the knockdown experiments; those siRNAs duplexes that are effective for silencing should produce exactly the same phenotype. Furthermore, a non-specific siRNA duplex may be needed as control.
5
If the siRNA does not work, first verify that the target sequence and the cell line used are derived from the same organism. Finally, make sure that the mRNA sequence used for selection of siRNA duplexes is reliable; it could contain sequencing errors, mutations, or polymorphisms.
Methods for Selecting Effective siRNA Sequences
After that, many siRNA design guidelinesrules were reported as follows. Reynolds et al. analyzed 180 siRNAs systematically, targeting every other position of two 197-base regions of fire-fly luciferase and human cyclophilin B mRNA 90 siRNAs per gene, and reported the following eight criteria for improving siRNA selection 18.
G1 Reynolds et al. eight criteria: 1 GC content 30?52%,
2 at least 3 As or Ts at positions 15?19, 3 absence of internal repeats, 4 an A at position 19, 5 an A at position 3, 6 a T at position 10, 7 a base other than G or C at position 19, and 8 a base other than G at position 13.
Ui-Tei et al. examined 72 siRNAs targeting six genes and reported four rules for effective siRNA designs 19. They are summarized as follows.
G2 Ui-Tei et al. four rules: 1 A or T effective and G or C ineffective at position 19, 2 G or C effective and A or T ineffec-tive at position 1, 3 at least 5 T or A residues from positions 13 to 19, and 4 no GC stretch more than 9 nt long.
Amarzguioui and Prydz analyzed 46 siRNAs targeting four genes and reported the following six rules for effective siRNA designs based on their literature 20.
G3 Amarzguioui and Prydz six rules: 1 G or C positive and T negative at position 1, 2 A positive at position 6, 3 T negative at position 10, 4 T positive at position 13, 5 C positive at position 16, and 6 A or T positive and G negative at position 19.
Jagla et al. tested 601 siRNAs targeting one exogenous and three endogenous genes and reported four rules as follows 22.
G4 Jagla et al.: 1 A or T positive at position 19, 2 A or T positive at position 10, 3 G or C positive at position 1, and 4 more than three ATs between positions 13 and 19.
Hsieh et al. examined 138 siRNAs targeting 22 genes and reported the following position-specific characteristics 21.
G5 Hsieh et al.: 1 T positive and G negative at position 19, 2 C or G positive and A or T negative at position 11, 3 G positive at position 16, 4 A positive at position 13, and 5 C negative at position 6.
Takasaki
The above previous works for positional characteristics in siRNA designs are summarized in Table 1.1 a.
Other scoring, screening, and designing methods for func-tional siRNAs have also been reported recently. Chalk et al. reported seven rules “Stockholm rules” based on thermody-namic properties. They are 1 total hairpin energy 1, 2 anti-sense 5′ end binding energy 9, 3 sense 5 ′ end binding energy in range 5?9 exclusive, 4 GC between 36 and 53%, 5 middle 7?12 binding energy 13, 6 energy difference 0, and 7 energy difference within ? 1 and 0. The score of an siRNA candi-date is incremented by one for each rule fulfilled, giving a score range of 0,7 13.
Huesken et al. reported the screen method of functional siR-NAs by using an artificial neural network 23. This network was first trained by 2182 randomly selected siRNAs targeted to 34 genes and was used in the design of a genome-wide siRNA col-lection with two potent siRNAs per gene.
Teramoto et al. and Ladunga reported functional siRNA selection methods using support vector machine SVM 14,34. Teramoto et al. used generalized string kernel GSK combined with SVM. siRNA sequences were represented as vectors in a multidimensional feature space according to the numbers of subsequences in each siRNA and classified into effective or inef-fective siRNAs 14. Ladunga used SVM with polynomial ker-nels and constrained optimization models from 572 sequence, thermodynamic, accessibility, and self-hairpin features over 2200
Table 1.1a Effective and ineffective nucleotides specified in the individual guidelines
Position 1 3 6 10 11 13 16 19
G1 Preferred A T ACT AT
G2 Preferred GC AT
Unpreferred AT GC
G3 Preferred GC A T C AT
Unpreferred T T G
G4 Preferred GC AT AT
G5 Preferred CG A G T
Unpreferred C AT G
Position: nucleotide position from 1 to 19 5 ′ to 3′, cDNA form. Preferred: effective positive, unpreferred: ineffective negative.