HIV-1 Ultradeep Sequence Analysis Pipeline

Manuscript currently in preparation

Phase 0: Quality Filtering

Alignments are filtered by removing sequences of low quality (as determined by their PHRED scores). The current defaults are set to only include reads of minimum length 100 bp and PHRED scores > 20. A PHRED score of 10 is approximately equivalent to 95% confidence, whereas 20 is approximately 99%. The HIV-1 454 analysis pipeline is also available for download.

Phase 1: Amino acid and nucleotide alignment

The alignment phase first performs amino acid alignment between a chosen reference sequence and each of the reads. Only alignments that exceed an alignment score threshold are retained, where the threshold is 5 x the the alignment score expected from a read of equal length and identical base composition. The next alignment step tries to include reads which failed the amino acid alignment by performing pairwise nucleotide alignments to the consensus of the reads which passed the amino acid alignment. Sequences are included in this second step if the pairwise per nucleotide alignment score exceeds the median of that from all reads included in the amino acid alignment step.

Phase 2: Estimation of summary statistics

This phase reports summary statistics on read length, depth and frequencies of minority variants.

Phase 3: Diversity Analysis

The sliding window analysis phase estimates nucleotide diversity in sliding windows which meet the minimum coverage criteria. Phylogenies are also estimated within sliding windows, and bootstrap resampling is applied to the sliding window with at least 4 variants and maximum nucleotide diversity. The latter is useful for the estimation of dual/multi infection, although the power to recover well-supported trees is reduced since reads are typically short (<200bp).

Phase 4: Mutation rate estimation

The number of mutation rate classes is estimated using a binomial mixture model. Briefly, we fit a model with a single rate class and estimate the mutation rate from a binomial distribution with the number of successes equal to the number of observed mutations at a site, and the number of trials equal to the observed coverage at a site. Additional rate classes are added using a mixture of binomial models until model fit (evaluated using AIC) is no longer improved. The parameters of the binomial mixture model (i.e. rates and their respective proportions) are estimated using maximum likelihood.

Phase 5: Selection analysis

Selection at sites is evaluated using all pairwise comparisons between reads. We estimate the ratio of observed non-synonymous to synonymous substitutions (weighted by the number of pairwise comparisons) and compare this to that expected given the observed codon frequencies and the genetic code.

Phase 6: Drug resistant mutation analysis

For each drug resistant site we estimate the mutation rank (i.e. the rank of the mutation rate with respect to all other sites) and calculate the median mutation rank of all drug resistant sites. The probability (P) that the median mutation rank at drug resistant sites is greater than an equivalent-sized sample of non-drug resistant sites is evaluated with permutations (n=1000). These data can be used to determine if mutation properties at drug resistant sites are unique. Furthermore, we classify drug resistant sites into mutation rate classes using the same methods described in the mutation rate class estimation procedure. Here we evaluate the posterior probability that a drug resistant site falls within a particular mutation rate class.

Phase 7: Identification of drug resistant compensatory mutations

This analysis phase screens reads for the occurrence of both drug resistant and compensatory mutation sites. A Fisher's exact test is performed to determine whether drug resistant mutations and compensatory mutations occur more frequently than expected by chance.

Results

All results are presented online. Result databases are also available for each gene processed for subsequent analysis and processing. We are in the process of writing dedicated HyPhy scripts for these purposes which will be made available here.

UCSD Viral Evolution Group 2004-2024