Fast and accurate phylogeny reconstruction using filtered spaced-word matches

2017 | journal article. A publication with affiliation to the University of Göttingen.

Jump to: Cite & Linked | Documents & Media | Details | Version history

Cite this publication

​Fast and accurate phylogeny reconstruction using filtered spaced-word matches​
Leimeister, C.-A.; Sohrabi-Jahromi, S. & Morgenstern, B. ​ (2017) 
Bioinformatics33(7) pp. 971​-979​.​ DOI: https://doi.org/10.1093/bioinformatics/btw776 

Documents & Media

btw776.pdf930.41 kBAdobe PDF

License

Published Version

Attribution-NonCommercial 4.0 CC BY-NC 4.0

Details

Authors
Leimeister, Chris-Andre; Sohrabi-Jahromi, Salma; Morgenstern, Burkhard 
Abstract
Motivation:Word-based or 'alignment-free' algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results: We propose Filtered Spaced Word Matches ( FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don't-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don't-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don't-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes.
Issue Date
2017
Status
published
Publisher
Oxford Univ Press
Journal
Bioinformatics 
ISSN
1460-2059; 1367-4803
Sponsor
International Max Planck Research School Molecular Biology, Gottingen

Reference

Citations


Social Media