I am a probabilist interested mainly in random trees, discrete or continuous, the stochastic processes revolving around them, and their applications to evolutionary biology:
I currently work as a lecturer (maître de conférences) in the Probability and Statistics team in LmB, in Université de Franche-Comté at Besançon. You can find me in office 351B.
Before that, I carried out a doctoral thesis on new probabilistic models of random structured phylogenies, under the supervision of Amaury Lambert in LPSM (Sorbonne Université) and in the SMILE group hosted at Collège de France. The defence took place on December 2, 2019, and the manuscript can be found here.
Bureau 351B
Laboratoire de Mathématiques de Besançon
Université de Franche-Comté
16 route de Gray
25030 Besançon CEDEX
We define infinitesimal gradient boosting as a limit of the popular tree-based gradient boosting algorithm from machine learning. The limit is considered in the vanishing-learning-rate asymptotic, that is when the learning rate tends to zero and the number of gradient trees is rescaled accordingly. For this purpose, we introduce a new class of randomized regression trees bridging totally randomized trees and Extra Trees and using a softmax distribution for binary splitting. Our main result is the convergence of the associated stochastic algorithm and the characterization of the limiting procedure as the unique solution of a nonlinear ordinary differential equation in a infinite dimensional function space. Infinitesimal gradient boosting defines a smooth path in the space of continuous functions along which the training error decreases, the residuals remain centered and the total variation is well controlled.
We introduce a biologically natural, mathematically tractable model of random phylogenetic network to describe evolution in the presence of hybridization. One of the features of this model is that the hybridization rate of the lineages correlates negatively with their phylogenetic distance. We give formulas / characterizations for quantities of biological interest that make them straightforward to compute in practice. We show that the appropriately rescaled network, seen as a metric space, converges to the Brownian continuum random tree, and that the uniformly rooted network has a local weak limit, which we describe explicitly.
We study a class of individual-based, fixed-population size epidemic models under general assumptions, e.g., heterogeneous contact rates encapsulating changes in behavior and/or enforcement of control measures. We show that the large-population dynamics are deterministic and relate to the Kermack-McKendrick PDE. Our assumptions are minimalistic in the sense that the only important requirement is that the basic reproduction number of the epidemic ${R}_{0}$ be finite, and allow us to tackle both Markovian and non-Markovian dynamics. The novelty of our approach is to study the "infection graph" of the population. We show local convergence of this random graph to a Poisson (Galton-Watson) marked tree, recovering Markovian backward-in-time dynamics in the limit as we trace back the transmission chain leading to a focal infection. This effectively models the process of contact tracing in a large population. It is expressed in terms of the Doob h-transform of a certain renewal process encoding the time of infection along the chain. Our results provide a mathematical formulation relating a fundamental epidemiological quantity, the generation time distribution, to the successive time of infections along this transmission chain.
We present a unifying, tractable approach for studying the spread of viruses causing complex diseases that require to be modeled using a large number of types (e.g., infective stage, clinical state, risk factor class). We show that recording each infected individual's infection age, i.e., the time elapsed since infection, 1. The age distribution $n(t,a)$ of the population at time $t$ can be described by means of a first-order, one-dimensional partial differential equation (PDE) known as the McKendrick-von Foerster equation. 2. The frequency of type $i$ at time $t$ is simply obtained by integrating the probability $p(a,i)$ of being in state $i$ at age a against the age distribution $n(t,a)$. The advantage of this approach is three-fold. First, regardless of the number of types, macroscopic observables (e.g., incidence or prevalence of each type) only rely on a one-dimensional PDE "decorated" with types. This representation induces a simple methodology based on the McKendrick-von Foerster PDE with Poisson sampling to infer and forecast the epidemic. We illustrate this technique using a French data from the COVID-19 epidemic. Second, our approach generalizes and simplifies standard compartmental models using high-dimensional systems of ordinary differential equations (ODEs) to account for disease complexity. We show that such models can always be rewritten in our framework, thus, providing a low-dimensional yet equivalent representation of these complex models. Third, beyond the simplicity of the approach, we show that our population model naturally appears as a universal scaling limit of a large class of fully stochastic individual-based epidemic models, here the initial condition of the PDE emerges as the limiting age structure of an exponentially growing population starting from a single individual.
We consider fragmentation processes with values in the space of marked partitions of $\mathbb{N}$, i.e. partitions where each block is decorated with a nonnegative real number. Assuming that the marks on distinct blocks evolve as independent positive self-similar Markov processes and determine the speed at which their blocks fragment, we get a natural generalization of the self-similar fragmentations of Bertoin (2002). Our main result is the characterization of these generalized fragmentation processes: a Lévy-Khinchin representation is obtained, using techniques from positive self-similar Markov processes and from classical fragmentation processes. We then give sufficient conditions for their absorption in finite time to a frozen state, and for the genealogical tree of the process to have finite total length.
Starting from any graph on $\{1,\dots ,n\}$, consider the Markov chain where at each time-step a uniformly chosen vertex is disconnected from all of its neighbors and reconnected to another uniformly chosen vertex. This Markov chain has a stationary distribution whose support is the set of non-empty forests on $\{1,\dots ,n\}$. The random forest corresponding to this stationary distribution has interesting connections with the uniform rooted labeled tree and the uniform attachment tree. We fully characterize its degree distribution, the distribution of its number of trees, and the limit distribution of the size of a tree sampled uniformly. We also show that the size of the largest tree is asymptotically $\alpha \mathrm{log}n$, where $\alpha =(1-\mathrm{log}(e-1){)}^{-1}\approx 2.18$, and that the degree of the most connected vertex is asymptotically $\mathrm{log}n/\mathrm{log}\mathrm{log}n$.
Similarly as in (Electron. J. Probab. 23 (2018)) where nested coalescent processes are studied, we generalize the definition of partition-valued homogeneous Markov fragmentation processes to the setting of nested partitions, i.e. pairs of partitions $(\zeta ,\xi )$ where $\zeta $ is finer than $\xi $. As in the classical univariate setting, under exchangeability and branching assumptions, we characterize the jump measure of nested fragmentation processes, in terms of erosion coefficients and dislocation measures. Among the possible jumps of a nested fragmentation, three forms of erosion and two forms of dislocation are identified – one being specific to the nested setting and relating to a bivariate paintbox process.
Poursuivant l’idée de (Electron. J. Probab. 23 (2018)) où les processus de coalescence emboîtés sont étudiés, nous étendons ici la définition des processus de fragmentation markoviens homogènes aux processus de fragmentation à valeurs dans les partitions emboîtées, c’est-à-dire les paires de partitions $(\zeta ,\xi )$ telles que $\zeta $ soit plus fine que $\xi $. Comme dans le contexte classique (dit univarié), sous des hypothèses d’échangeabilité et de branchement, nous caractérisons la mesure de saut des processus de fragmentation emboîtés en termes de coefficients d’érosion et de mesures de dislocation. Les sauts d’une fragmentation emboîtée peuvent être de plusieurs natures différentes : nous distinguons trois formes d’érosions et deux formes de dislocations, l’une d’elles étant spécifique au contexte des partitions emboîtées et étant générée par un processus de pots de peinture bivarié.
For the random interval partition of $[0,1]$ generated by the uniform stick-breaking scheme known as GEM$(1)$, let ${u}_{k}$ be the probability that the first $k$ intervals created by the stick-breaking scheme are also the first $k$ intervals to be discovered in a process of uniform random sampling of points from $[0,1]$. Then ${u}_{k}$ is a renewal sequence. We prove that ${u}_{k}$ is a rational linear combination of the real numbers $1,\zeta (2)\u202f,\dots ,\zeta (k)$ where $\zeta $ is the Riemann zeta function, and show that ${u}_{k}$ has the limit $1/3$ as $k\to \u202f\infty $. Related results provide probabilistic interpretations of some multiple zeta values in terms of a Markov chain derived from the interval partition. This Markov chain has the structure of a weak record chain. Similar results are given for the GEM$(\theta )$ model, with beta$(1,\theta )$ instead of uniform stick-breaking factors, and for another more algebraic derivation of renewal sequences from the Riemann zeta function.
We consider the compact space of pairs of nested partitions of $\mathbb{N}$, where by analogy with models used in molecular evolution, we call “gene partition” the finer partition and “species partition” the coarser one. We introduce the class of nondecreasing processes valued in nested partitions, assumed Markovian and with exchangeable semigroup. These processes are said simple when each partition only undergoes one coalescence event at a time (but possibly the same time). Simple nested exchangeable coalescent (SNEC) processes can be seen as the extension of $\Lambda $-coalescents to nested partitions. We characterize the law of SNEC processes as follows. In the absence of gene coalescences, species blocks undergo $\Lambda $-coalescent type events and in the absence of species coalescences, gene blocks lying in the same species block undergo i.i.d. $\Lambda $-coalescents. Simultaneous coalescence of the gene and species partitions are governed by an intensity measure ${\nu}_{s}$ on $(0,1]\times {\mathcal{M}}_{1}([0,1])$ providing the frequency of species merging and the law in which are drawn (independently) the frequencies of genes merging in each coalescing species block. As an application, we also study the conditions under which a SNEC process comes down from infinity.
Consider a random real tree whose leaf set, or boundary, is endowed with a finite mass measure. Each element of the tree is further given a type, or allele, inherited from the most recent atom of a random point measure (infinitely-many-allele model) on the skeleton of the tree. The partition of the boundary into distinct alleles is the so-called allelic partition. In this paper, we are interested in the infinite trees generated by supercritical, possibly time-inhomogeneous, binary branching processes, and in their boundary, which is the set of particles “coexisting at infinity”. We prove that any such tree can be mapped to a random, compact ultrametric tree called the coalescent point process, endowed with a “uniform” measure on its boundary which is the limit as $t\to \u202f\infty $ of the properly rescaled counting measure of the population at time $t$. We prove that the clonal (i.e., carrying the same allele as the root) part of the boundary is a regenerative set that we characterize. We then study the allelic partition of the boundary through the measures of its blocks. We also study the dynamics of the clonal subtree, which is a Markovian increasing tree process as mutations are removed.
In the last two decades, lineage-based models of diversification, where species are viewed as particles that can divide (speciate) or die (become extinct) at rates depending on some evolving trait, have been very popular tools to study macroevolutionary processes. Here, we argue that this approach cannot be used to break down the inner workings of species diversification and that ``opening the species box'' is necessary to understand the causes of macroevolution. We set up a general framework for individual-based models of neutral speciation (i.e. no selection forces other than those acting against hybrids) that rely on a minimal number of mechanistic principles: (i) reproductive isolation is caused by excessive dissimilarity between pheno/genotypes; (ii) dissimilarity results from a balance between differentiation processes and homogenization processes; and (iii) dissimilarity can feed back on these processes by decelerating homogenization. We classify such models according to the main process responsible for homogenization: (1) clonal evolution models (ecological drift), (2) models of genetic isolation (gene flow) and (3) models of isolation by distance (spatial drift). We review these models and their specific predictions on macroscopic variables such as species abundances, speciation rates, interfertility relationships, phylogenetic tree structure\dots We propose new avenues of research by displaying conceptual questions remaining to be solved and new models to address them: the failure of speciation at secondary contact, the feedback of dissimilarity on homogenization, the emergence in space of reproductive barriers.
Infinitesimal gradient boosting is defined as the vanishing-learning-rate limit of the popular tree-based gradient boosting algorithm from machine learning (Dombry and Duchamps, 2021). It is characterized as the solution of a nonlinear ordinary differential equation in a infinite-dimensional function space where the infinitesimal boosting operator driving the dynamics depends on the training sample. We consider the asymptotic behavior of the model in the large sample limit and prove its convergence to a deterministic process. This infinite population limit is again characterized by a differential equation that depends on the population distribution. We explore some properties of this population limit: we prove that the dynamics makes the test error decrease and we consider its long time behavior.