Analyze your data Home Help Citations Job Queue Stats HyPhy package
Navigation Banner

How does REL infer selection?

Complete method details can be found in our MBE paper
Phase 1: Nucleotide model maximum likelihood (ML) fit
A nucleotide model (any model from the time-reversible class can be chosen) is fitted to the data and tree (either NJ or user supplied) using maximum likelihood to obtain branch lengths and substitution rates. If the input alignment contains multiple segments, base frequencies and substitution rates are inferred jointly from the entire alignment, while branch lengths are fitted to each segment separately. The "best-fitting" model can be determined automatically by a model selection procedure or chosen by the user.
Phase 2: Codon model ML fit
Holding branch lengths proportional to and subsitution rate parameters constant at the values estimated in Phase 1, a codon model obtained by crossing MG94 and the nucleotide model of Phase 1 is fitted to the data to obtain independent rate distributions for dN and dS. This methods allows for rate heterogeneity both in synonymous and non-synonymous rates, by fitting a 3 bin general discrete distribution to synonymous rates, and another 3 bin general discrete distribution to dN, yielding 9 possible values for the ratio dN/dS.
Phase 3: Empirical Bayes analysis.
For every site, utilizing parameter estimates from Phases 1 and 2 we compute two Bayes Factors, one for the event that {dN<dS} at that site (negative selection), and another for the event that {dN>dS} (positive selection). When these Bayes Factors are sufficiently large (say 50 or more), we call such a site selected. Note, that Bayes Factors can not be in general easily related to statistical significance, although our simulation studies showed respectable power even for small datasets and reasonable false positive rates. As a rule of thumb, 1/Bayes Factor is analogous to the p-value of the other two tests in this setting. This method tends to be less conservative and slower than SLAC and FEL.
Note
This method is a generalization of site-by-site positive selection analyses implemented in Ziheng Yang's PAML. The main differences are
  1. More general nucleotide bias models
  2. Modeling of synonymous rate variation as well as non-synonymous rate variation
  3. Use of Bayes factors for empirical Bayes result processing (although the Bayes Empricial Bayes procedure in recent versions of PAML is more suited from smaller and 'noisier' datasets).
Refer to this paper for a detailed discussion.
UCSD Viral Evolution Group 2004-2024  
Datamonkeys Webcomic New! Spidermonkey. HyPhy Package Page Datamonkey.org start page