Special Scientific Talk - Towards Perfect de novo DNA assembly

Date

Thursday, September 27, 2018 - 13:00 to 14:30

Location

C209, Center Building

Description

ABSTRACT

We are about to enter an era of DNA sequencing where one can in the near future produce, de novo, a reference-quality genome of any living species for 1,000 EU.  This ability will revolutionize ecology, evolution, and conservation science and effectively mark the beginning of a new exploration of the natural world.

The technological driver is the advent of long read sequencers such as the PacBio Sequel and Oxford Promethion.  The long reads in effect make assembly easier, and one sees corresponding improvements in the continuity of the results, but the underlying algorithms are effectively the same as those first developed 20 years ago, and repetitions at the scale of read length are still an issue.  Indeed, truly better assembly requires finding all artifacts in the reads and the resolution of repeat families, topics that I don’t think have received sufficient attention and that are particularly critical issues for long reads.

Therefore we are developing algorithms that carefully analyze a long read shotgun data set before assembly. By efficiently comparing all the data against itself we have developed a computational approach to accurately determine the quality of any stretch of a PacBio read based only on the sequence data itself.  These intrinsic QVs allow us to  accurately identify low quality regions, chimers, and missed adaptamers.  Removing these artifacts with a process we call scrubbing leaves one with reads that assemble without the need for base-level error correction.  We have further developed a heuristic consensus algorithm that is far more efficient and accurate than pervious methods and further identifies potential sites of variation due to haplotypes or repeats.  Using this algorithm we further correct reads, typically to Q40 (99.99% accurate).  In effect, we have developed a process that takes Q7 reads full of artifacts, and produces Q40 artifact-free reads solving all aspects of the assembly problem save the separation of nearly identical, ubiquitous, and large repeats.

 

SHORT BIOGRAPHY

In 2012 Gene Myers joined a growing group of computational biologists in Dresden as the founding director of a new Systems Biology Center that is being built as part of an extension of the Max-Planck Institute of Molecular Cell Biology and Genetics (MPI-CBG). His group focuses on engineering specialized light microscopes for cell biology and analyzing the imagery produces by such microscopes. Previously Gene had been a group leader at the HHMI Janelia Farm Research Campus (JFRC) since its inception in 2005. Gene came to the JFRC from UC Berkeley where he was on the faculty of Computer Science from 2003 to 2005. From 1998 to 2002 he was the Vice President of Informatics Research at Celera Genomics where he and his team determined the sequences of the Drosophila, Human, and Mouse genomes using the whole genome shotgun technique that he advocated in 1996. Prior to that Gene was on the faculty of the University of Arizona for 17 years and he received his Ph.D in Computer Science from the University of Colorado in 1981.

His research interests include the design and analysis of algorithms for problems in computational molecular biology, image analysis of bioimages, and light microscopy with a focus on building models of the cell and cellular systems from imaging data. He is best known for the development of  BLAST
-- the most widely used tool in bioinformatics, and for the paired-end whole genome shotgun sequencing protocol and the assembler he developed at Celera that delivered the fly, human, and mouse genomes in a three year period. He has also written many seminal papers on the theory of sequence comparison.

He was awarded the IEEE 3rd Millenium Achievement Award in 2000, the Newcomb Cleveland Best Paper in Science award in 2001, and the ACM Kanellakis Prize in 2002. He was voted the most influential in bioinformatics in 2001 by Genome Technology Magazine and was elected to the National Academy of Engineering in 2003. In 2004 he won the International Max- Planck Research Prize and in 2005 was selected as one of two distinguished alumni (with David Haussler) at his alma-mater, the University of Colorado. In 2006 Gene was inducted into Leopoldina, the German Academy of Science and awarded an honorary doctorate at ETH, Zurich.

All-OIST Category: 

Subscribe to the OIST Calendar: Right-click to download, then open in your calendar application.