Supporting Material for "ASTRID: Accurate Species TRees from Internode Distances"

Datasets

The datasets analyzed are available at the following locations: In addition, the ASTRAL-2 50-taxon data with missing taxa is available as alignments and as gene trees estimated with FastTree. There are 50 replicates, and each file is named [estimatedgenetre|all-genes.phylip][lxxx][myy] where xxx is the sequence length (either 150bp or 300bp) and yy is the number of missing taxa (up to 40). 10-taxon data with missing taxa is also available, although it was not analyzed for the ASTRID paper, as alignments and as gene trees estimated with FastTree Species trees for the avian and mammalian datasets are available here.

Distance-based methods

Scripts and tools

The script used to randomly delete taxa from a FASTA file is delete_taxa.py, which can be found on GitHub. To use it, run python delete_taxa.py fasta_file maxlen n1 n2 n3 ... where n1, n2, n3, ... is a space separated list of taxon numbers to delete and maxlen is a limit on the output sequence size (longer sequences will be truncated).

For FastTree to work properly on datasets with missing taxa, the missing taxa must actually be deleted from the file - it is not sufficient to replace the characters with dashes. (however, to run a concatenation analysis, the missing taxa must be present and have dashes in place of characters).

To fully delete the missing taxa, get filter_seqs.py which is also on the GitHub repository above and run python filter_seqs.py fasta_file > filtered_fasta_file. Then, FastTree can be run on this filtered file with the command fasttree -nt -gtr -quiet -nopr -gamma -n 1000 filtered_fasta_file > genetreefile

To measure topological error, we used the script developed by Siavash Mirarab available with instructions for use at https://github.com/redavids/phylogenetics-tools/tree/master/comparetrees`