How does GA Branch infer selection?

Complete method details can be found in this MBE paper

Phase 1: Nucleotide model maximum likelihood (ML) fit

A nucleotide model (any model from the time-reversible class can be chosen) is fitted to the data and tree (either NJ or user supplied) using maximum likelihood to obtain branch lengths and substitution rates. If the input alignment contains multiple segments, base frequencies and substitution rates are inferred jointly from the entire alignment, while branch lengths are fitted to each segment separately. The "best-fitting" model can be determined automatically by a model selection procedure or chosen by the user.

Phase 2: ML Codon model fitting

Holding branch lengths proportional to and subsitution rate parameters constant at the values estimated in Phase 1, a codon model obtained by crossing MG94 and the nucleotide model of Phase 1 is fitted to the data to obtain a tree-wide estimate of ω.

Phase 3: Genetic algorithm search for branch allocation

Given B branch types (B=2 initially, and is incremented after each GA iteration, until no further improvement can be obtained), a genetic algorithm (CHC) is used to search for good fitting model among all those which allocate each tree branch to one of B rate classes (with a separate ω for every class). The fitness of each model is determined by its small sample AIC score (c-AIC), and branch lengths are re-estimated with after the last model for which the branch lengths have been estimated is 50 c-AIC points worse than the current best model.

Phase 4: Multimodel inference

A 95% confidence intervals is determined using their Akaike weights, and various quantities (e.g. ω for every branch, the probability that ω> 1) are computed using model averaging.

Note

Unlike branch site methods, GA branch does not need the user to select branches of interest to test, or test one branch at a time (which can lead to statisitcal instability or acceptance of poorly supported models, see Section 1.5. for discussion), but rather mines the data for good-fitting models. In addition, inference based on multiple models (as opposed to a null-alternative pair) is more robust to model misspecification. On the other hand, the current version of GA branch does not easily accomodate site-to-site ω variation (except uniformly along branches).

UCSD Viral Evolution Group 2004-2024