Online Mini-symposium on Probabilistic and Topological Methods for Biological Data

as part of the SIAM Conference on Mathematics of Data Science (MDS20)

Speakers in the topology session

  • Francis C. Motta, Florida Atlantic University, USA.
  • Title: Evidence of an Intrinsic Clock and Host Parasite Coupling in the Intraerythrocytic Developmental Cycle of Plasmodium.

    Abstract

    Living things did not evolve and do not exist as autonomous elements, but instead represent the nodes of a complex network of interacting dynamical systems, coupled at multiple scales to each other and their environment. For instance, many biological organisms across kingdoms exhibit intrinsic, free-running circadian rhythms that are controlled by gene regulatory networks and are coupled to the rhythmic forcing of the day-night cycle. Although there is evidence that Plasmodium---the causative agent of malaria infection---leverages the intrinsic periodicity of the host for its own purposes, it remains unknown if these protozoans possess a regulatory network controlling the precise timing of their intraerythrocytic developmental cycles (IDC). In this talk we model the variability of IDC progression rates across a population of parasites, and compare it to the intrinsic period variability seen in free-running circadian oscillators, to show that the IDC exhibits properties that are consistent with those of well-tuned biological clocks. We also discuss a data-driven analysis of a singular ex vivo experiment that captures the transcript dynamics of P. vivax together with the dynamic gene expression of 9 human hosts that suggests host-parasite coupling at the level of their transcriptional programs.

  • Marilyn Vazquez, The Ohio State University, USA.
  • Title: A Consistent Density-Based Clustering Algorithm and its Application to Image Segmentation.

    Abstract Data clustering is a fundamental task for discovering patterns in data, and is central to machine learning. Often, a data set is assumed to live in a manifold and be sampled according to a probability measure. Then the clusters can be defined as peaks in the sampled probability density, and a clustering algorithm would need to identify the peaks in the density to compute the clusters. Some of the challenges in this approach include the non-uniform sampling of the density and the bridges between peaks of the density. To solve these problems, we propose a new clustering algorithm that divides the clustering problem into three steps: picking a good threshold on the sample density to separate the peaks, clustering the superlevel set at the chosen threshold, and classifying the remaining points. We explain the key details of these steps, and provide theoretical assurances on the performance. As an important application, we show how to apply this method to segment images by considering the images as a point-cloud of image patches. We present results on images of various biological systems.
  • Maria-Veronica Ciocanel,, The Ohio State University, USA.
  • Title: Topological Data Analysis for Ring Channels in Cell Biology.

    Abstract Contractile rings are cellular structures made of actin filaments and are important in development, wound healing, and cell division. In the worm model organisms, ring channels allow nutrient exchange to the developing egg cells and are regulated by forces exerted by myosin motor proteins. I will present an agent-based modeling and data analysis framework for the interactions between actin filaments and myosin motor proteins. This approach provides key insights for the mechanistic differences between two motors that are believed to maintain the rings at a constant diameter. In particular, we propose tools from topological data analysis to understand time-series data of filamentous network interactions. Our proposed methods clearly reveal the impact of certain parameters on significant topological circle formation, thus giving insight into these biological ring channels.
  • Manuchehr Aminian, Colorado State University, USA.
  • Title: Identifying generators of topological features in real data.

    Abstract When we work with a synthetic data set in persistent homology, such as the classic "noisy circle", our intuition allows us to tie back a computed topological feature - for instance, a birth/death pair - back to the expected "loop" structure of the data. However, we cannot apply such intuition as easily if we do not "know the answer" in advance. This is a problem if we need concrete answers to question such as "why is this birth/death pair present," and more specifically "what subset of the data points can well-represent this pair?"

    I will give my perspective as an applied mathematician on this problem and present my preliminary work building an interface in Python to directly associate computed topological features to their generators, with application to two separate projects: in studying protein structure, and in drawing knowledge about human immune response to influenza-like illnesses in patients in the first few days after exposure.

    Speakers in the probability session

  • Wasiur R. KhudaBukhsh, The Ohio State University, USA.
  • Title: Dynamic Survival Analysis of Epidemics and How COVID-19 shaped it.

    Abstract

    This talk will introduce the notion of dynamic survival analysis (DSA). We show that solutions to ordinary differential equations (ODEs) describing the large-population limits of Markovian stochastic epidemic models can be interpreted as survival or cumulative hazard functions when analysing data on individuals sampled from the population. We refer to the individual-level survival and hazard functions derived from population-level equations as a survival dynamical system (SDS). To illustrate how population-level dynamics imply probability laws for individual-level infection and recovery times that can be used for statistical inference, we show numerical examples based on synthetic data as well as the COVID-19 outbreak data.

    The second part of the talk will focus on developing the DSA methodology for non-Markovian dynamics. Measure-valued processes play a key role in this endeavour. For the non-Markovian set-up, the DSA-likelihood is shown to depend on solutions to partial differential equations instead of ODEs as before. Preliminary numerical results for parameter inference will be shown. Finally, extension to non-Markovian epidemics on configuration model random graphs will be discussed.

  • Arindam Fadikar, Argonne National Laboratory, U.S.; Robert Gramacy and David M. Higdon, Virginia Tech, U.S.
  • Title: Clustering based Gaussian Process Emulation and Calibration of a Stochastic Agent based Model.

    Abstract Gaussian process (GP) model is an effective tool for emulating complex computer simulations. Heterogeneous gaussian process (Binois et al, 2017) has been shown to be superior in the presence of input dependent noise as in the case of any stochastic computer simulation. However, all GP models impose a gaussian variability assumption in the emulator. In this talk, we propose a new approach based on heterogeneous GP and a clustering based technique to emulate and hence calibrate a stochastic agent based simulation. The basic idea is to relax the normality assumption by borrowing the standard gaussian mixture model and coupling that with a traditional GP. The study is motivated by with an example taken from the 2015 Ebola challenge workshop which simulated an Ebola epidemic to evaluate methodology.
  • Pragya Sur, Harvard University, USA.
  • Title: Modern Likelihood based Approaches for High-Dimensional Non-Linear Models.

    Abstract Generalized linear models are a class of widely used non-linear models in statistics. Classical maximum-likelihood theory based statistical inference is ubiquitous in this context. This theory hinges on well-known fundamental results: (1) the maximum-likelihood-estimate (MLE) is asymptotically unbiased and normally distributed, (2) its variability can be quantified via the inverse Fisher information, and (3) the likelihood-ratio-test (LRT) is asymptotically a Chi-Squared. In this talk, I will consider the specific setting of logistic regression models and show that in the common modern setting where the number of features and the sample size are both large and comparable, classical results are far from accurate. In fact, (1) the MLE is biased, (2) its variability is far greater than classical results, and (3) the LRT is not distributed as a Chi-Square. Consequently, p-values and confidence intervals based on classical theory are completely invalid. I will describe a new theory that precisely characterizes the asymptotic behavior of the MLE and the LRT under certain assumptions on the covariate distribution; this in turn yields valid p-values and confidence intervals in such high-dimensional settings. If time permits, I will discuss general techniques that may enable the study of likelihood based estimators in other non-linear models under similar high-dimensional asymptotics. This is based on joint works with Emmanuel Candes, Yuxin Chen and Qian Zhao.
  • Yuekai Sun, University of Michigan, USA.
  • Title: Communication-Efficient Integrative Regression in High-Dimensions.

    Abstract We consider the task of meta-analysis in high-dimensional settings in which the data sources we wish to integrate are similar, but nonidentical. To borrow strength across such heterogeneous data sources, we introduce a global parameter that remains sparse even in the presence of outlier data sources. We also propose a one-shot estimator of the global parameter that preserves the anonymity of the data sources and converges at a rate that depends on the size of the combined dataset. Finally, we demonstrate the benefits of our approach on a large-scale drug treatment dataset.