Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches

Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus; Hatje, Klas; Kollmar, Martin; Leimeister, Chris-Andre; Morgenstern, Burkhard

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches

2014 | journal article. A publication with affiliation to the University of Göttingen.

Jump to: Cite & Linked | Documents & Media | Details | Version history

Cite this publication

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches
Horwege, S.; Lindner, S.; Boden, M.; Hatje, K.; Kollmar, M.; Leimeister, C.-A. & Morgenstern, B. (2014)
Nucleic Acids Research, 42(W1) pp. W7-W11. DOI: https://doi.org/10.1093/nar/gku398

Copy

GRO View APA Chicago MLA Vancouver

Citable link

GRO.publications Link

Further links

Documents & Media

Nucl. Acids Res.-2014-Horwege-W7-W11.pdf575.87 kBAdobe PDF

License

Published Version

Special user license Goescholar License

Details

Authors: Horwege, Sebastian; Lindner, Sebastian; Boden, Marcus; Hatje, Klas; Kollmar, Martin; Leimeister, Chris-Andre; Morgenstern, Burkhard
Abstract: In this article, we present a user-friendly web interface for two alignment-free sequence-comparison methods that we recently developed. Most alignment-free methods rely on exact word matches to estimate pairwise similarities or distances between the input sequences. By contrast, our new algorithms are based on inexact word matches. The first of these approaches uses the relative frequencies of so-called spaced words in the input sequences, i.e. words containing 'don't care' or 'wildcard' symbols at certain pre-defined positions. Various distance measures can then be defined on sequences based on their different spaced-word composition. Our second approach defines the distance between two sequences by estimating for each position in the first sequence the length of the longest substring at this position that also occurs in the second sequence with up to k mismatches. Both approaches take a set of deoxyribonucleic acid (DNA) or protein sequences as input and return a matrix of pairwise distance values that can be used as a starting point for clustering algorithms or distance-based phylogeny reconstruction.
Issue Date: 2014
Status: published
Publisher: Oxford Univ Press
Journal: Nucleic Acids Research
ISSN: 1362-4962; 0305-1048
Sponsor: Open-Access-Publikationsfonds 2014

Export Metadata

Refman EndNote BibTeX RefWorks Excel CSV

Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches

Cite this publication

Citable link

Further links

Documents & Media

License

Details

Export Metadata

Reference

Citations

Social Media