Supporting Material for "ASTRID: Accurate Species TRees from Internode Distances"
DatasetsThe datasets analyzed are available at the following locations:
- Avian and Mammalian Simulated: https://www.ideals.illinois.edu/handle/2142/55319/browse?type=title
- Mammalian 250bp: http://www.cs.utexas.edu/~phylo/datasets/weighted-binning-datasets.html
- 10-taxon: http://www.cs.utexas.edu/~phylo/datasets/weighted-binning-datasets.html
- 15-taxon: http://www.cs.utexas.edu/~phylo/datasets/weighted-binning-datasets.html
- ASTRAL-2: http://www.cs.utexas.edu/~phylo/datasets/astral2/
- FastME 1 used in paper: http://www.ncbi.nlm.nih.gov/CBBresearch/Desper/FastME.html
- PhyD*: http://www.atgc-montpellier.fr/phyd/binaries.php
Scripts and tools
The script used to randomly delete taxa from a FASTA file is delete_taxa.py, which can be found on GitHub. To use it, run python delete_taxa.py fasta_file maxlen n1 n2 n3 ... where n1, n2, n3, ... is a space separated list of taxon numbers to delete and maxlen is a limit on the output sequence size (longer sequences will be truncated).
For FastTree to work properly on datasets with missing taxa, the missing taxa must actually be deleted from the file - it is not sufficient to replace the characters with dashes. (however, to run a concatenation analysis, the missing taxa must be present and have dashes in place of characters).
To fully delete the missing taxa, get filter_seqs.py which is also on the GitHub repository above and run python filter_seqs.py fasta_file > filtered_fasta_file. Then, FastTree can be run on this filtered file with the command fasttree -nt -gtr -quiet -nopr -gamma -n 1000 filtered_fasta_file > genetreefile
To measure topological error, we used the script developed by Siavash Mirarab available with instructions for use at https://github.com/redavids/phylogenetics-tools/tree/master/comparetrees`