Skip to main content

Nik Tavakolian: Shepherd - A Fast Data-Driven Clustering Scheme for Error Correction in DNA Sequencing

Time: Wed 2021-04-28 14.00 - 14.45

Location: Zoom, meeting ID: 611 3329 7865

Participating: Nik Tavakolian

Export to calendar

Abstract

DNA barcodes are short DNA sequences introduced into a cell population to track the relative frequencies of lineages over time. These barcodes have been widely used in biomedical applications in, e.g., tracking evolutionary lineages in yeast, following the progression of breast cancer in humans, etc. In general, the barcodes are unknown upon insertion and must be identified using modern sequencing technologies, which are error prone and could result in a large number of error reads. In this study, we cast the barcode error correction task as a clustering problem which aims to identify true barcode reads from noisy sequencing data.

We present Shepherd, a novel data-driven clustering method, based on an indexing system of the barcode reads in terms of k-mers (a well-known concept in bioinformatics), and a Bayesian approach to distinguish true from error reads. The k-mer indexing scheme significantly speeds up the comparison of barcode reads, allowing us to handle datasets with millions of reads. Moreover, our Bayesian decision scheme accounting for the error rates of the barcode sequencing provides a prominent improvement in the accuracy of barcode identification over other state-of-the-art error correction schemes, such as Bartender and Starcode. In this talk, I will introduce the intuitive concepts behind Shepherd and demonstrate its performance in terms of both synthetic and real barcode sequencing data from the lineage tracking of the baker's yeast Saccharomyces cerevisiae.