Physics and Biology Unit (Jonathan Miller)

Sequence Conservation Within and Among Genomes:
Quantitative Evolutionary, Comparative, and Biomedical Genomics

The Unit's research is aimed at development of theoretical and computational tools for decoding  genome sequences. We are especially interested on the impact of linkage, recombination, and epistasis on sequence conservation, and how they affect inference of selective pressure.


As we first reported in 2005 [1,2,3] the lengths of strongly conserved insect and vertebrate sequences exhibit a distinctive and anomalous 'heavy tailed' profile that is inconsistent with the standard local evolutionary dynamics dominated by point mutations or short indels. To a physical scientist, such a heavy tail represents the hallmark of strong non-local spatial correlation of conserved bases - in biologists' jargon, "linkage" [4].  Although we at first believed the linkage reflected functional interaction [5,6,7], our more recent report that heavily repetitive sequence shares this feature [8] suggests that it derives from a neutral mechanism. The neutral mechanism associated with linkage is recombination, whose role in sequence evolution and comparative genomics has largely been treated as intractable or of only marginal relevance.

Here, we report on the first studies of the length distributions of exact and nearly exact duplicate sequences within a single genome (E. Taillefer and J. Miller, 2010[9]. We identify an analogous heavy tail in this distribution - shared among metazoans, plants, and bacteria - a phenomenon we call "ultra-duplication." The elements comprising this tail include, for example, homeobox sequences. We show that the form of this tail is remarkably constrained, and we exhibit a minimal model that reproduces the observed constraint (M. Koroteev and J. Miller, 2010; submitted). The essential property of the model is the occurrence of duplication events at all scales.


  1. B. Kammandel et al. (1999). Developmental Biology 205, 79–97.
  2. G. Bejerano et al. (2004). Science 304(5675), 1321-5.
  3. J. Miller and P. Havlak (2005). CSHL Genome Informatics Meeting.  
  4. The author is deeply grateful to Sydney Brenner for asserting the application of the term “linkage” to these observations.
  5. W. Salerno et al. (2006). Proc Natl Acad Sci USA, 103(35), 13121-5.
  6. J. Miller (2007). Genomic_Imprint_of_the_Interactome-_University_of_Genome_Evolution [PDF]
  7. J. Miller (2007). First annual q-bio conference on cellular information processing Santa Fe, NM (2007) [PDF]
  8. J. Miller (2009). IEICE technical report, Neurocomputing, 109(53) (20090518) [PDF]
  9. K. Gao, J. Miller (2011) Algebraic Distribution of Segmental Duplication Lengths in Whole-Genome Sequence Self-Alignments. PLoS ONE 6(7): e18464. doi:10.1371/journal.pone.001846 [PLoSone]
  10. M.V. Koroteev and J. Miller. Scale-free duplication dynamics: a model for ultraduplication. Phys. Rev. E. [PDF] (Supporting Materials )
  11. E.Taillefer, J.Miller. Algebraic length-distribution of sequence duplications in whole genomes. In Proc. of International conf. on natural computation, v. 3, p. 1454-1460, Shanghai, China, Jul. 2011. [PDF] (IEEE Explore)
  12. Eddy Taillefer and Jonathan Miller. Exhaustive computation of exact sequence duplications in whole genomes via super and local maximal repeats. In International Conference on Computer Engineering and Bioinformatics, Vol. 21, pp. 22–29, Cairo, Egypt, Oct. 2011. Excellent paper award. [PDF]