# FY2015 Annual Report

**Mathematical Biology Unit
Associate Professor Robert Sinclair**

## Abstract

The highlight of this year was most certainly the publication in PNAS of our joint work on teleost whole genome duplication. This was a very demanding project from many points of view, but also very pleasing to see how the challenges were overcome by intense interdisciplinary work involving researchers from multiple universities across Japan. What was particularly interesting, from the point of view of mathematical biology, was the intimate connection established between genomic and mathematical aspects of the work.

## 1. Staff

- Dr. Jun Inoue, Staff Scientist
- Shino Fibbs, Administrative Assistant

## 2. Collaborations

### 2.1 Genome Evolution

- Type of collaboration: JSPS Grant 24770070
- Researchers:
- Professor Mutsumi Nishida, University of the Ryukyus
- Professor Katsumi Tsukamoto, Nihon University
- Assistant Professor Yukuto Sato, Tohoku University

### 2.2 Theoretical Virology

- Description: Algorithm development for highly sensitive sequence similarity detection, beyond what is usually called the twilight zone of sequence similarity. We are using viral capsid sequences as a challenging test case. Intended applications are in the area of genome annotation.
- Type of collaboration: Joint research
- Researchers:
- Professor Dennis Bamford, University of Helsinki
- Dr. Janne Ravantti, University of Helsinki

### 2.3 Integrable systems

- Description: Crashing of organismal pattern generators or the collapse of ecosystems may in some cases be describable in terms of loss of integrability. This is a long-term project which involves, and will perhaps contribute to, cutting-edge mathematics.
- Type of collaboration: Joint research
- Researchers:
- Professor Martin Guest, Waseda University
- Associate Professor Takashi Sakai, Tokyo Metropolitan University

## 3. Activities and Findings

### 3.1 Mathematical Biology (R. Sinclair)

I understand Mathematical Biology to be the study of structures or patterns in biology, with the long-term aim of contributing both to the understanding of biology and also to the development of mathematics. I do not consider Mathematical Biology to be purely a branch of applied mathematics, since that does imply a hierarchy or directed flow of knowledge, rather than an engagement between equal partners from which both stand to benefit. For the same reason, I do not consider Mathematical Biology to be purely an exercise in modeling, although modeling is often a part of my work.

The primary challenge, as I see it, is one of *communication*. A mathematician cannot contribute to an understanding of biology without making a sincere attempt to understand biological questions in their biological context. It is for this reason that I devote a significant fraction of my time to reading the primary literature and discussing biology with biologists. For any of the activities described below to be successful, communication is central. Mathematical approaches need to be described in a manner which is useful for biologists, and biological questions need to be formulated in a manner which allows for mathematical study.

Major themes of my work in Mathematical Biology have been and will remain **(i)** the relationship between biological homology (in the sense of Richard Owen), similarity or dissimilarity, and geometry (discrete, Euclidean, Riemannian, Finsler or otherwise), **(ii)** an attempt to understand genome sequences, specifically including questions of evolutionary history, information content (hopefully leading to a better definition of “information” in this context) and possible constraints on these. In the following, I will describe my approach to Mathematical Biology, future plans for work in these areas, and conclude with descriptions of some current research projects.

**3.1.1 Homology/similarity/geometry**

The relationship between homology, similarity and geometry was brought to my attention largely as a result of being invited to work on the statistical analysis of dinosaur bone morphologies, where the shapes of the fossilised bones provided a very solid instance of geometry. In future work, I would like to revisit this area with the specific goal of defining shape measurement protocols which are as insensitive as possible to distortions resulting from taphonomic and geologic processes. This may be an application of classical invariant theory.

Geometry can however also be relevant to the study of protein or genome sequences, where the relevant theory is likely to come from discrete geometry, possibly of a nonstandard type (e.g. the triangle inequality is known not to be satisfied for some commonly used substitution matrices, meaning that a detour can appear to be shorter than a direct path).

The study of biological homology absolutely demands that both “historical” and “functional” aspects of the object of study be taken into consideration. The alignment of protein sequences, commonly performed by software tools such as BLAST, is theoretically challenging because it is used in both the study of evolutionary relatedness and also function (active site identification and the like): The problem is that one tool is used in more than one context. The first question to be asked is whether a single tool can in fact be applied in both contexts, or whether two tools are required. The most striking class of examples I can think of in this connection are the overprinted protein coding genes found in some viral genomes. The protein sequences (of original and overprinted genes) are often outwardly dissimilar, even those portions which derive from precisely the same genomic region. Any common protein sequence alignment tool would provide some indication of a degree of dissimilarity, but on the basis of an alignment most likely failing to reflect the actual genomic relationships between the codons in question. In other words, it is not clear what the meaning of an alignment could be in such a case, nor the measure of dissimilarity derived from it. Similar theoretical problems arise in connection with multiple alignments of sequences which are not known to be related by descent or function: What can one infer from such an alignment, if anything at all?

I have approached these questions from the point of view that engagement with a specific family of proteins, which have been studied experimentally from many points of view, is the best way to inspire theory. In collaboration with Prof. Dennis Bamford, a virologist at the University of Helsinki, and a senior scientist in his lab, I have been working on detecting similarity between highly divergent protein sequences of viral capsid proteins, beginning with those which are associated with icosahedral capsids. It is already known that these capsid proteins form natural groupings, or lineages, on the basis of structural analysis. Protein sequence similarity can sometimes be detected using standard bioinformatics tools, but more usually not. I approached the problem of searching for protein sequence similarity using alignments as the basis, as is very often done, but not merely from the point of view of “What is their optimal alignment?”, but instead first asking “Can they be aligned?” It has been possible to develop algorithms, based upon dynamic programming and z-score statistics, which give some indication of whether a given set of sequences may be alignable (in the sense that each makes a positive, and statistically significant, contribution to the total alignment score). If so, dynamic programming offers a sensitive, if slow, method of producing alignments. Using this combination, we have been able to show that a structure-based classification of viral capsid proteins can be reproduced on the basis of the highly divergent protein sequences alone. We believe that the new method will also be applicable to the difficult but medically and agriculturally important problem of orphan gene annotation (homology detection).

Intermediate sequences are commonly sought in connection with ancestral sequence reconstruction or pragmatic attempts to improve alignments between divergent sequences. The word “intermediate” does imply “middle” in some sense, and this raises the question of whether there is any relation to mathematical expressions such as (a+b)/2. The issue is made more complex by the fact that the substitution matrices usually used to evaluate whether a sequence is “between” two other given ones do not necessarily correspond to distances in the usual mathematical (metric) sense. Given the usefulness (beyond ancestral sequence reconstruction) of a software tool which could construct midpoint sequences independently of sampling biases typical of database content, I am already working with collaborators in Helsinki (in the lab of Prof. Bamford) on a prototype bioinformatics tool. This has required a rethink of what “midpoint” could mean in a discrete context which is not necessarily geometrical in any standard sense, and considerations of how such a definition might be translated into a working tool. One promising approach is to cast the problem of finding the midpoint in terms of a minimax problem (m=(a+b)/2 minimizes max{d(a,m),d(m,b)} in Euclidean geometry of any dimension), and to solve this problem using a modified dynamic programming algorithm in which the amino acids of the midpoint sequence are determined as the algorithm progresses. The final aim of this work is to contribute to bioinformatics, a natural goal given the fact that mathematical approaches to practical problems do often translate well into practical algorithms.

**3.1.2 Genome sequences**

For some time, I have been fascinated by the question of what “mathematical function” of a genome would change most slowly given a realistic model of genomic evolution. This led to an unexpected, fundamental discovery in the theory of *k*-mers, that there are simple relationships between the counts of *k*-mers which are exact on circular molecules. The fact that the relations are unavoidable (they are combinatorial in nature, and not related to any evolutionary, chemical or physical property of nucleic acids) means that they are best thought of as defining what is irrelevant to the goals of most genomic studies. This is a rare case in which one has definite knowledge regarding what is not relevant, and allows one to focus instead on what is. I intend to continue working on the implications of these relations, also in other contexts such as vision (for regular compound eyes and retinae with regular photoreceptor grids, such as is the case for many teleosts) where necessary relations are directly relevant for any discussion of the efficient coding hypothesis.

The fact that hidden structures can be found in *k*-mer analyses opens up a number of future directions. First, the existence of these relations demands a reconsideration of the statistics of *k*-mers, since the number of degrees of freedom is significantly lower than had been assumed. Next, there is the question of whether these relations can contribute to improved metagenome binning algorithms or genome size estimates. The answer appears to be in the affirmative.

**3.1.3 Current projects**

Sponges are theoretically fascinating for a number of reasons. Their apparent lack of any standard structure (in the sense of obvious symmetries or “architecture” which would be recognisable to a non-expert) motivates one to ask whether one can discern any less obvious structures or patterns. In loose collaboration with researchers at The University of Queensland, I am currently working on the structure of choanocyte chambers of sponges. In many cases, these are approximately spherical structures, with flagellated choanocytes on the surface, and a large open outlet for the water they pump to exit. Geometrically, the positions of the choanocytes would appear to be well approximated by the centres of circles packed onto a sphere, as in the Tammes problem. From a network point of view, Euler’s Law for polyhedra would appear to apply (the doubt arises from not knowing to what extent choanocytes have stable neighbours within their chamber). What is most interesting here is the fact that too much symmetry could interfere with pumping efficiency and/or the ability of the chamber to grow by recruitment, since gaps are useful for both. Circle packing on a sphere is however well known to produce asymmetric arrangements with plenty of gaps. I am currently exploring whether a mathematical model of choanocyte growth can be constructed from such considerations. This has already required use of OIST’s supercomputing facilities. Experimental data exists, and will be used to guide theoretical and computational work.

I have also been working for some time on small-sample statistics, given the difficulty of obtaining large numbers of samples in some important areas of the life sciences. From my discussions with experimental biologists, it has become clear to me that currently available statistical approaches often fail to be truly appropriate. The approach I am taking is to begin with properties experimentalists would desire from small-sample statistics (such as estimates not changing abruptly given additional data). This requires new notions of continuity, for example. The ultimate goal would be a type of analysis which would be uniquely determined by such a list of desired properties. I have already made progress on this problem, but much work remains.

The response of organisms to stress is also a problem of theoretical interest, with many challenging aspects. For example, a central pattern generator may be robust with respect to temperature increases, but then suddenly fail above a critical temperature. Can one discern signs of stress in the dynamics of the generated patterns before the collapse? Much unrelated work on electrical power grids and the like suggest that there can be signs. In the case of a periodic signal (the pyloric rhythm of a crab, for example), there are reasons to believe that one sign of stress may be a “thickening” of the trajectory of the system in a suitable phase space. I am looking at this in collaboration with Prof. Martin Guest of Waseda University, to see whether these signs of collapse are related to modes of loss of integrability in certain dynamical systems. This is an example where the results are likely to be of interest to both biologists and mathematicians.

## 4. Publications

### 4.1 Journals

- Jun Inoue, Yukuto Sato, Robert Sinclair, Katsumi Tsukamoto and Mutsumi Nishida, Rapid genome reshaping by multiple-gene loss after whole-genome duplication in teleost fish suggested by mathematical modeling, PNAS, Volume 112, No. 48 (2015) 14918–14923. doi: 10.1073/pnas.1507669112 .
- Robert Sinclair, Necessary Relations for Nucleotide Frequencies, Journal of Theoretical Biology, Volume 374 (June 2015) 179-182. doi: 10.1016/j.jtbi.2015.03.025 .

### 4.2 Books and other one-time publications

Nothing to report

### 4.3 Oral and Poster Presentations

- Jun Inoue,
*Origin and evolution of the freshwater eels*[Talk in Japanese], Eel Planet, Nihon University, Japan, December 19 (2015). - Jun Inoue,
*Identification of phylogenetic marker genes considering whole genome duplication and species tree*. 17th Annual Meeting of Society of Evolutionary Studies, Japan, August 20-23 (2015). - Robert Sinclair,
*This is not relevant*, NIG-OIST Joint Symposium: Evolutionary Bioinformatics, OIST, Okinawa, Japan, August 10-12 (2015).

## 5. Intellectual Property Rights and Other Specific Achievements

Nothing to report

## 6. Meetings and Events

Nothing to report

## 7. Other

Nothing to report.