Till innehåll på sidan
Till KTH:s startsida Till KTH:s startsida

Nik Tavakolian: Shepherd - A Fast Data-Driven Clustering Scheme for Error Correction in DNA Sequencing

Tid: On 2021-04-28 kl 14.00 - 14.45

Plats: Zoom, meeting ID: 611 3329 7865

Medverkande: Nik Tavakolian

Exportera till kalender

Abstract

DNA barcodes are short DNA sequences introduced into a cell population to track the relative frequencies of lineages over time. These barcodes have been widely used in biomedical applications in, e.g., tracking evolutionary lineages in yeast, following the progression of breast cancer in humans, etc. In general, the barcodes are unknown upon insertion and must be identified using modern sequencing technologies, which are error prone and could result in a large number of error reads. In this study, we cast the barcode error correction task as a clustering problem which aims to identify true barcode reads from noisy sequencing data.

We present Shepherd, a novel data-driven clustering method, based on an indexing system of the barcode reads in terms of k-mers (a well-known concept in bioinformatics), and a Bayesian approach to distinguish true from error reads. The k-mer indexing scheme significantly speeds up the comparison of barcode reads, allowing us to handle datasets with millions of reads. Moreover, our Bayesian decision scheme accounting for the error rates of the barcode sequencing provides a prominent improvement in the accuracy of barcode identification over other state-of-the-art error correction schemes, such as Bartender and Starcode. In this talk, I will introduce the intuitive concepts behind Shepherd and demonstrate its performance in terms of both synthetic and real barcode sequencing data from the lineage tracking of the baker's yeast Saccharomyces cerevisiae.