How does Spidermonkey work?

UPDATE 2008-07-01
Our application note on Spidermonkey has been published online at Bioinformatics Advance Access.

This method was introduced in the evolutionary context in our paper on the evolution of the phenotypically important and highly variable V3 loop of the envelope glycoprotein in HIV-1.

Overview

Spidermonkey is an analytical pipeline (series of data-processing algorithms) for identifying co-evolving sites from in an alignment of protein-coding sequences. We provide an easy-to-use web interface to upload and screen alignments, infer a tree in cases when none is available, reconstruct the evolutionary history of sequences, and analyze coevolutionary patterns in that history using Bayesian graphical models (?).

The first three stages of the Spidermonkey analysis of protein-coding nucleotide sequences (e.g. codon sequences) in fact use the same pipeline as SLAC, but for your convenience the corresponding help text from the SLAC help page is provided below, plus additional material specific to Spidermonkey. We also enable you to detect co-evolving sites in non-coding nucleotide and amino acid sequences using methods similar to the SLAC pipeline (fitting the respective substitution model by maximum likelihood and reconstructing ancestral sequences).

Uploading your data

Because Spidermonkey utilizes the same data input methods as the rest of the analyses on Datamonkey, it accepts alignments in FASTA, PHYLIP, NEXUS, and MEGA formats. For further details, please refer to the help page on preparing your data. As usual, you will be given the option of using your own tree if it is included in the file along with the sequences; otherwise, you can choose to infer a tree de novo using the neighbor-joining method implemented in Datamonkey. Additionally, you can decide to use a specific model of nucleotide substitution (e.g. HKY85) or use the automated model selection algorithm in Datamonkey (based on likelihood).

[IMPORTANT] If your alignment contains more than one partition, you will not be allowed to choose Spidermonkey as an analytical option after the data upload and tree option pages. This is because Spidermonkey compares information about how non-synonymous substitutions are distributed across tree branches on a site-by-site basis. If two sites have different trees associated with them, it is generally impossible to match up inferred substitutions on the affected branches in a meaningful way.

Your alignment may be partitioned if you have previously analyzed it for recombination using GARD or other methods in HyPhy. Spidermonkey assumes that there has been no recombination in the alignment, so if you need to analyze for co-evolving sites in an alignment when recombination is present, extract each partition and upload them separately (running Spidermonkey for each individual partition). Note that in some cases it isn't strictly necessary to run separate partitions, when phylogenetic incongruence is due to variation in rates and not recombination. To determine whether or not this is the case, inspect the 'best split' or raw output from GARD and compare the topologies of the split trees.

Phase 1: Nucleotide/protein model maximum likelihood (ML) fit

For codon or nucleotide sequences, a nucleotide model (any model from the time-reversible class can be chosen) is fitted to the data and tree (either NJ or user supplied) using maximum likelihood to obtain branch lengths and substitution rates. The "best-fitting" nucleotide model can be determined automatically by a model selection procedure or chosen by the user. An analysis of non-coding nucleotide sequences jumps directly to Phase 3.

For amino acid sequences, one of several empirically-derived amino acid substitution rate matrices (e.g. PAM, HIV within; Nickle et al. 2007) are fit using maximum likelihood as above. We also provide an AIC-based model selection procedure to discriminate between several canonical models of protein evolution on the basis of the data. An analysis of protein coding sequences jumps directly to Phase 3.

Phase 2: Codon model ML fit

Holding branch lengths and subsitution rate parameters constant at the values estimated in Phase 1, a codon model obtained by crossing MG94 and the nucleotide model of Phase 1 is fitted to the data to obtain a global ω=dN/dS ratio. Optionally, a confidence interval on dN/dS can be obtained using profile likelihood.

Phase 3: Reconstructing ancestors and counting substitutions

Utilizing parameter estimates from Phases 1 and/or 2, ancestral sequences are reconstructed site by site using maximum likelihood, in such a way as to maximize the likelihood of the data at the site over all possible ancestral character states. Inferred ancestral sequences are treated as known for Phase 4.

It is also possible to resample ancestral sequences, weighting random outcomes in proportion to their posterior probability given your alignment. Resampling ancestral sequences provides a means of quantifying our confidence in identifying co-evolving sites given our uncertainty in inferring ancestral sequences.

In the case of codon sequences, only nonsynonymous substitutions are mapped to branches of the tree, i.e., based on whether codons on either end of the branch encode different amino acids (details). Codon sites at which no nonsynonymous substitutions have been inferred are automatically discarded from further analysis. Similarly, all invariant sites in nucleotide or protein sequences are discarded.

Intermission: Everything you ever wanted to know about Bayesian graphical models but was too socially well-adjusted to ask

A Bayesian graphical model (BGM) is a compact representation of the joint probability distribution of many random variables. In mathematics and computer science, a graph (or network) is a visual depiction of the relationship between two or more individuals, in which each individual is represented by a node (i.e. a point or vertex). Relationships between nodes are indicated by drawing a line connecting the nodes, which is referred to as an edge. In this context, edges represent significant statistical associations between individual codon sites of an alignment, which could be caused by functional or structural interactions between the corresponding residues of the protein.

Edges may be directed (arrows) to indicate that one node is conditionally dependent on another; otherwise, they are undirected and we make no distinction between whether A is dependent on B or vice versa. For example, one would be inclined to draw an arrow starting from a node for whether it is raining, and terminating at a node for whether the lawn is wet. (It's more intuitive to think about this in terms of the state of node A causing the state of node B, but with this we are treading near the morass of inferring causality from data; while many believe that this is possible, it remains an open philosophical problem and hence very bad juju for us simple-minded biologists.)

The power of BGMs lies in exploiting the Markov property (aka as conditional independence) to isolate each node from the rest of the graph once its dependence on its neighbors is accounted for. For example, suppose we wanted to figure out the probability that my genome contains a particular genetic marker. Naively, I would have to assume that this probability was roughly equal to the expected frequency of that marker in the population. (Of course, we could be more sophisticated and stratify the population, using the frequency for my particular ethnic group, etc.) Now, if I learn that my paternal grandfather had X copies of the marker, then that I could now condition (at least in part) my probability on this observation. Moreover, if I subsequently learn my FATHER's genotype (this is becoming a well-characterized family!), then (all else being equal) it no longer matters what my grandfather's genotype was.

If you still don't understand what we're talking about here, you're probably thinking about it too much. :-)

Another advantage of using BGMs is that we're looking at all the variables, all at the same time. This is a big improvement over applying, say, a pairwise association test statistic (e.g. chi-squared contingency tables) to every pair of variables and throwing all the significant pairs into a big bucket. Oftentimes, this results in a big mess as far as trying to understand the big picture. It's like trying to read a street map that's been through an industrial blender.

To make matters worse, pairwise comparisons can pick up a large number of extraneous associations. In the family example above, my genotype depends on my grandfather's genotype only insofar it influences my father's genotype. In pairwise comparisons may obtain THREE significant associations (me-dad, me-grandpa, dad-grandpa) where there are actually only two (me-dad, dad-grandpa).

Being able to ignore the rest of the graph when we're computing the (posterior) probability of each node makes life a lot easier. After we're done, we can use the chain rule of probability to compute the joint posterior probability of the whole graph (nodes, edges, and all). Every graph is a hypothesis of how variables work together. Our goal is to find the most likely graph given our data. However, this isn't always a feasible task. The number of possible graphs (esp. if we're working with directed edges) grows factorially with the number of nodes (for example, there are over a trillion different undirected graphs on just 10 nodes). Being biologists, we won't usually have enough data to be able to choose just one graph out of thousands of millions as being the most likely graph.

This is where order-MCMC comes in. Instead of trying to pick just one graph, we're going to average over subsets of graphs, which are going to be defined by permutations in a hierarchical ranking of the nodes in the graph (node orders). A permutation is one of many ways of ordering a set of distinct objects into a linear sequence. For example, the Three Stooges can walk into a door-frame in six permutations: Larry, Curly, Moe; Larry, Moe, Curly; and so on. With a node order, we are making an assertion about which nodes can depend on (be the 'child' of) other nodes. The trick is then to wander around the permutation space of node orders with a Monte Carlo Markov Chain, comparing the posterior probability of each subset of graphs defined by the current hierarchy.

The nice thing about this approach is that it's more efficient (there are way fewer permutations for N nodes than the number of undirected graphs --- N! versus 2^(N*(N-1)/2) to be exact --- and it characterizes what the 'correct' graph looks like when we have limited data.

Further reading in Bayesian graphical models:
The standard text is Pearl (1988) [google books]. However, I personally find it a little esoteric and not well suited for biologists with mathematical aspirations. I find the following papers to provide excellent background with lucid prose:

GE Cooper and E Herskovits (1993) A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4): 309-347. [journal link]
N Friedman and D Koller (2003) Being Bayesian about network structure: a Bayesian appraoch to structure discovery in Bayesian networks. Machine Learning 50(1-2): 95-125. [journal link]

Phase 4: Bayesian graphical model (BGM) analysis setup

Spidermonkey requires you to specify two analytical options. The first is whether to use a maximum of one or two parents per node. If you specify a maximum of one parent per node (i.e. a node can be conditionally dependent on no more than one other node in the graph), then the inference of structure will be based on undirected edges. Otherwise, inference will be based on directed edges. Restricting the number of parents to one greatly reduces the computational complexity of the problem (roughly proportional to the number of nodes in the network). Consequently, we have more lax restrictions about the number of sites (1000 node maximum) that you can run under a one-parent scenario.

On the other hand, if you really care about trying to infer causality (a path filled with many hazards and bogeymen) or you can anticipate , then it is more appropriate to run a two-parent network analysis. Because this consumes more computing time (roughly cubic with increasing number of nodes), we restrict you to a 75 node maximum. (If you really want to run more, you can always download HyPhy and do the analysis yourself. Nyah!)

What to do? Given that you'll probably need to select a smaller number of codon sites from your alignment, Spidermonkey provides a site-selection interface to filter sites based on one of several site-specific variation statistics:

Count of branches with NS subs --- filter sites based on a minimum number of inferred nonsynonymous substitutions in the tree.
Proportion of branches with NS subs --- filter sites based on a minimum fraction of branches in the tree with nonsynonymous substitutions at each site.
Entropy --- filter sites based on mutual information (entropy) measure of variation.

For your convenience, all three of these statistics are listed in a table (excluding sites at which no nonsynonymous substitutions were inferred). You can also manually select an arbitrary set of sites using checkboxes in the table, or select all or none of the sites in the table. This encapsulates the second option for this analysis (we did say there were two).

If numbers aren't informative enough, you can inspect the distribution of substitution events in the tree by selecting one of the Map options in the table:

[Nucleotide/Codon/AA] --- Display substitutions mapped to branches in the tree as a PNG image, exportable as a PDF.
[Counts] --- Display a color-coded table of individual substitutions mapped to named branches.

When you're done selecting a subset of sites, punch the Run BGM/Spidermonkey button to start the analysis! Your browser will be directed to a webpage that will display a progress report of your BGM analysis, refreshing automatically every 60 seconds. Most BGM analysis submitted to our cluster are currently taking about 10 minutes to complete. If you can't wait, you can always bookmark this page and return to it with your web browser at a later time (for up to 96 hours).

Phase 5: Spidermonkey/BGM Analysis Results

Once the BGM analysis has completed, your browser will be automatically redirected to the default results page. The first set of results consists of the set of putatively co-evolving sites in the alignment under the default posterior probability cut-off (0.5). You can always retabulate this list using your own cut-off value, which may be necessary if none of the edge posterior probabilities in the BGM exceed the default level.

The second item on the front results page is a scatterplot of the Monte Carlo Markov Chain (MCMC) trace, where the sample number is plotted on the x-axis and log posterior probability of the model (in this case, node ordering) on the y-axis. Essentially, the usual diagnostics for MCMC analyses apply here: you want to see a near-zero slope that is consistent with a sufficient burn-in period, and lack of autocorrelation in the trace that indicates sufficient thinning of the chain sample. Any textbook or review article on MCMC should feature some discussion on convergence diagnostics, and will usually point out that said diagnostics tend to be lousy. Fortunately, the order-MCMC algorithm implemented in Spidermonkey is supposed to have fairly good convergence properties. Although we can NEVER be sure that an MCMC process has converged, I have yet to see a job on Spidermonkey egregiously fail to converge (the only unambiguous scenario).

At the top of the front results page, you will find three links to detailed reports:

HTML --- Displays the entire list of marginal posterior probabilities for all edges in the graph as an HTML-rendered table.
CSV --- Displays the same list but in plain ol' ASCII in a comma-separated (CSV) format.
Plots --- Displays options for generating a visualation of the BGM, which can be exported in PNG (portable network graphics), Postscript, and PDF (portable document format) formats. Intended to help you generate figures depicting Spidermonkey results.

UCSD Viral Evolution Group 2004-2024