youtube image
From YouTube: Chirag Jain: Sketch-based algorithms for large-scale whole-genome comparisons

Description

Sketch-based algorithms for large-scale whole-genome comparisons
Chirag Jain, Indian Institute of Science

December 14, 2022
9:00-9:30 Pacific Time

In the era of exponential data growth, sketching has become a standard algorithmic technique for rapid genome sequence comparison. I will describe our work on fast, lightweight approximate sequence mapping algorithm by using minimizer sampling and MinHash techniques. The proposed algorithm computes the positional origin of a query sequence in a given reference and estimates nucleotide-level identity under an assumed probabilistic model of mutations. We show an application of this algorithm in quantifying relatedness between two microbial genomes and its impact in tackling a long-standing biological question. Microbiologists are increasingly turning to whole-genome sequencing driven approaches to address fundamental questions associated with ecology. This involves quantifying similarity of two or more genomes, e.g., to check whether a newly sequenced genome is novel, or where else has it been seen before. We developed FastANI (Average Nucleotide Identity) software by using the proposed approximate sequence matching framework to quantify similarity of two or more genomes. Our algorithmic improvements, coupled with parallelizability allowed us to index entire database of 90,000 bacterial genomes, and compute pairwise ANI values among all pairs of genomes for the first time. This analysis sheds light on the extent to which microbes form discrete clusters (species).