Ivan Tolstoganov: Multi-context seeds enable fast and high-accuracy sequence alignment
Time: Wed 2025-02-19 13.00 - 14.00
Location: Room Cramer
Participating: Ivan Tolstoganov
Abstract
Sequence alignment is the process of finding the best matching substring within a long string by optimizing a string similarity measure (e.g. Hamming distance). Sequence alignment is necessary in analysis of DNA, RNA, or protein sequences across multiple organisms. A key step in aligning two sequences is to identify short subsequences, or seeds, that are found in both the query and the reference sequence. A well-known trade-off is that longer seeds offer fast searches but lower sensitivity in variable regions. We introduce multi-context seeds (MCS) which effectively allow us to store seeds with different lengths in the same index structure, thus retaining the advantages of both shorter and longer seeds. We demonstrate the practical applicability of MCS by implementing them in the existing genome alignment tool strobealign. We show that strobealign with MCS increases the short sequence alignment accuracy with no additional cost in memory and little cost in runtime.