FY2011 Annual Report

Physics and Biology Unit

Professor Jonathan Miller

_MG_8055ps2

Abstract

The key role that duplication plays in evolution has been recognized for nearly a century, but only in the last ten years has it been understood that sequence duplications occur in both bacteria and eukaryotes at rates per generation 10,000 times those of single-base substitution. Thus, nature – “the blind plagiarizer,” in H.C. Lee’s phrase – continually plunders her own manuscripts (genome sequences) to generate novel variation for selection to act upon. Most of the time, a sequence duplication occurring in an individual within a population confers no fitness benefit and is lost within a few generations.

Yet, plagiarism may be a tricky business to execute effectively and efficiently. If what the interminable rat race calls for just now were merely a short oligonucleotide such as new copy of a promoter, it would be reckless of nature to plagiarize a million bases of sequence at a time. On the other hand, sometimes a new copy of an operon or gene cluster is what will give the organism an edge, so the copying apparatus needs to regularly duplicate sequences at these very large scales as well. How is the proper balance of scales achieved?

One might imagine that over the last few billion years, evolution has figured out how to plagiarize in such a way as to hedge all bets, and we could look to genome sequence for clues to its strategy. We reported this year the first exhaustive census of the lengths of exact duplications in whole genomes. It turns out that the distributions of duplicated sequence lengths are scale-free.

1. Staff

Dr. Kun Gao, Researcher
Dr. Sathish Venkatesan, Researcher
Dr. Maxim Koroteev, Researcher
Dr. Eddy Taillefer, Researcher
Dr. Federico Manna, Researcher
Quoc-Viet Ha, Technical Staff
Midori Tanahara, Administrative Assistant

2. Collaborations

Nothing to report.

3. Activities and Findings

3.1 Repetitive structure of genomes.

The repetitive structure of genomic DNA is a primary object of study in genomics: duplication is often associated with sequence rearrangement and human disease. Repetitive genome structure can be classified into tandem repeats, interspersed repeats, transposons, and segmental duplication – the latter defined, for example, as two or more segments of DNA longer than 1000 base-pairs sharing at least 90% identity. Because we are interested here in neutral evolution, we study exact duplicates of all lengths, irrespective of classification or function.

Our study of sequence duplication is further motivated by observation of an unexplained power-law regime in the length distribution of sequences conserved between divergent genomes – sequences of which at least one copy can be found in each of two or more genomes. Those inter-genomic computations have been performed by hash-table based search and whole-genome alignment with the alignment tool BLASTZ (recently supplanted by LASTZ). Whole-genome alignment has recently demonstrated an intra-genome power-law regime in the length distribution of exactly duplicated sequence within a single genome; however, practical whole-genome alignment algorithms are heuristic and time consuming; hash-table based methods are unwieldy and can make it difficult to recover sequence features such as local maximality (see below) and copy number.

3.2 New sequence elements for counting repeats: super, local, and non-nested local maxmers.

To account in a model-independent way for copies of subsequences, we generalize the standard notion of maximal subsequence.

A repeat is a set of identical duplications (occurrences) at distinct positions in a chromosome.
A super maximal repeat (super maxmer) is a repeat that is not contained in any longer duplication.
A local maximal repeat (local maxmer) has no extension that is itself repeated.
A nested occurrence is a local maxmer contained witin some longer maxmer
A non-nested local maxmer is a local maxmer having at least one occurrence that is not nested.

In the example sequence:

NLAREPLNOREPTFCGIREPTLSIG

the sub-sequence REPT is a super maxmer; the sub-sequence REP is a local maxmer; and the first occurrence of the sub-sequence REP is a non-nested local maxmer. The super maxmers together with the non-nested local maxmers constitute a minimal decomposition of a chromosome into maxmers, in the sense of approximating the smallest grammar. They enable us to decompose a repetitive genome uniquely.

3.3 Characterization of duplications in natural genome sequences.

We computed maxmers and their distributions for a comprehensive set of publicly available genomes. Illustrated in figure 1 are maxmer distributions for successive assemblies of the sea urchin genome. As assemblies mature, the maxmer distributions increasingly conform to a straight line on a log-log plot, with a slope in the neighborhood of -3. This scale-free form seems to be a generic feature of natural genomes that we can account for within models of neutral sequence evolution.

Plots for successive assembly versions are displaced for clarity.

Figure 1.

4. Publications

4.1 Journals

Gao K, Miller J, Algebraic Distribution of Segmental Duplication Lengths in Whole-Genome Sequence Self-Alignments. PLoS ONE. 6(7): e18464. doi:10.1371/journal.pone.0018464. (Peer-reviewed)
M. Koroteev and J. Miller. Scale-free duplication dynamics: A model for ultraduplication. Physical Review E. August 2011 (Peer-reviewed)

4.2 Books and other one-time publications

Nothing to report

4.3 Oral and Poster Presentations

M Koroteev and J Miller. A model for ultraduplication. Cold Spring Harbor Laboratory course Computational & Comparative Genomics. CSHL Bush Building. November 9 – 15 2011. (poster).
Eddy Taillefer and Jonathan Miller. Exhaustive computation of exact sequence duplications in whole genomes via super and local maximal repeats. In International Conference on Computer Engineering and Bioinformatics, Vol. 21, pp. 22–29, Cairo, Egypt, Oct. 2011. Excellent paper award. (peer-reviewed proceedings)
Eddy Taillefer and Jonathan Miller. Ultraduplication in genome sequences and natural language texts. In Asian Young Researchers Conference on Computational and Omics Biology, pp. P–09, Daejeon, Korea, Aug. 2011. (poster)
K Gao and J Miller. Algebraic Distribution of Segmental Duplication Lengths in Whole-Genome Sequence Self-Alignments. 5th Asian Young Researchers Conference on Computational and Omics Biology (AYRCOB) Daejoen, South Korea. Jul. 2011.
Eddy Taillefer and Jonathan Miller. Algebraic length distribution of sequence duplications in whole genomes. In International Conference on Natural Computation, Vol. 3, pp. 1454–1460, Shanghai, China, Jul. 2011. (peer-reviewed proceedings)
J Miller. A genome-wide view of synonymous codon conservation. QECG2011. Seaside House, OIST. June 3 2011.

5. Intellectual Property Rights and Other Specific Achievements

Nothing to report

6. Meetings and Events

6.1 OIST Summer School and Workshop QECG2011: Linkage and Recombination and Genome Sequences

Date: May 16 - June 3, 2011
Venue: OIST Seaside House
Organizers: Jonathan Miller (Physics and Biology Unit, OIST), Alexander Mikheyev (Ecology and Evolution Unit, OIST) and Gilean McVean (Department of Statistics, University of Oxford)

Speakers:
- Dan Andersson, Uppsala universitat
- Joachim Hermisson, University of Vienna
- Hideki Innan, Graduate School of Advanced Studies
- Ichizo Kobayashi, University of Tokyo
- Takehiko Kobayashi, National Institute of Genetics, Japan
- Thomas Lenormand, Centre d'Ecologie Fonctionnelle et Evolutive, Montpellier
- Michael Lynch, Indiana University
- Gilean McVean, Oxford
- Alexander Mikheyev, OIST
- Jonathan Miller, OIST
- Simon Myers, Oxford
- Molly Przeworski, University of Chicago
- David Romero, Centro de Ciencias Genómicas, UNAM
- Susan Rosenberg, Baylor College of Medicine
- Mikkel Schierup, Aarhus University
- Guy Sella, Hebrew University of Jerusalem
- Yun Song, UC Berkeley
- Joel Stavans, Weizmann Institute of Science
- Clifford Zeyl, Wake Forest University

6.2 Seminar

Date: July 25, 2012
Venue: OIST Campus C015
Speaker: Michael R. King (Cornell University)
Title: Multiscale Models and Mechanisms of Receptor Adhesion in the Circulation

6.3 Mini Workshop

Date: October 17, 0211
Venue: OIST Campus C210
Speaker: Pauline Fujita (University of California, Santa Cruz)
Title: UCSC genome browser workshop

6.4 Seminar

Date: November 09, 2011
Venue: OIST Campus D014
Speaker: Dr. Federico MANNA (Centre d’Ecologie Fonctionnelle et Evolutive, CNRS-Montpellier)
Title: "Fitness Landscapes: An Alternative Theory for the Dominance of Mutation"

6.5 Seminar

Date: November 13, 2011
Venue: OIST Campus B250
Speaker: Dr. Timothy J P Hubbard (Welcome Trust Sanger Institute, Cambridge)
Title: "From Genome Annotation to Genomic Medicine"

7. Other

Nothing to report.