Research
My current research focuses on developing flexible and scalable functional data analysis methods for modeling modern health data collected via wearable and implantable technologies (WIT), which allows for high-resolution, around-the-clock monitoring of critical signals from the human body. The complexity and scale of the data generated by these technologies challenge existing statistical methods, require novel analytic tools, and motivate my research.
High-resolution hemodynamic data
I am collaborating with anesthesiologists from the Johns Hopkins Medical Institute to decode high-resolution hemodynamic data collected from thousands of cardiac surgeries. Examples of hemodynamic data include mean arterial pressure (MAP), central venous pressure (CVP), cardiac index (CI), and many others.
Proper organ function depends on an adequate supply of oxygen-rich blood, which, in turn, is influenced by hemodynamic factors. Previous studies have identified a strong association between hemodynamic values and the risk of cardiac surgery-associated acute kidney injury (CSA-AKI), but it remains unknown how organ damage accumulates.
My work in this area focuses on developing nonparametric methods that can handle the complexity of repeated hemodynamic zone exposure with the goal of flexibly modeling organ damage accumulation.
Generalized multilevel functional data
The figure below shows my glucose data collected over six days using Dexcom’s Stelo device. I wanted to know when my glucose levels fall outside the recommended range for healthy individuals (70-140 mg/dL, indicated by the green shaded regions). This information can be represented as multilevel (for the multiple days) binary functional data, indicating whether glucose values are within or outside the target range. Other examples of binary functional data include hemodynamic data (inside or outside risky hemodynamic zones) and physical activity data collected via accelerometers (active vs. inactive periods).
Despite the abundance of generalized (binary, count, …) multilevel functional data, like those mentioned, current inferential methods are not well-suited to handle the size, complexity, and structure of such datasets.
My work in this area has led to the generalized multilevel functional principal component analysis (GM-FPCA), a novel and scalable method for extracting dominant modes of variation from such data, enabling effective inference and predictive modeling.
Related publication
Functional random effect inference
Given continuous monitoring data from a subject over the past 10 days, can we determine with high confidence, and with uncertainty quantification, whether the observations from day 11 appear normal or anomalous? This problem has broad applications, from medicine to engineering, but addressing it requires reliable inference on the functional random effect — a topic that has been underexplored.
My work in this area focuses on developing scalable methods to enable functional random effect inference not only for Gaussian functional data but also for other types of generalized functional data.
Prior work
Inference for massive and distributed longitudinal data
Longitudinal data are ubiquitous in medical research and linear mixed models (LMMs) are powerful tools for analyzing them. Statistical inference of variance component parameters has traditionally relied on bootstrap, which becomes prohibitively slow on large datasets.
My work in this area has led to the Bag of Little Bootstrap method for massive longitudinal data, which reduces the computational cost from \(O(BNq^3)\) to \(O(Bbq^3)\), where \(B\) is the number of bootstrap replicates, \(N\) is the number of subjects, \(q\) is the number of random effect parameters, and \(b \ll N\). It achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.
Related publication
Software
- MixedModelsBLB.jl: A Julia package for analyzing massive longitudinal data using Linear Mixed Models (LMMs) through the Bag of Little Bootstrap (BLB) method.
- Supports a variety of data interfaces including directly interfacing with databases.
- Supports parallel processing.
- Supports both gradient-based and gradient-free solvers.
Inference for constrained and regularized estimation problems
Many statistical learning tasks including Lasso, constrained Lasso, graphical Lasso, matrix completion, and sparse low rank matrix regression can be posed as regularized and/or constrained maximum likelihood estimation that require solving optimization problems of the form
\[\text{maximize} \quad \ell(\boldsymbol{\theta}) - \rho P(\boldsymbol{\theta}),\]where \(\boldsymbol{\theta}\) denotes model parameters, \(\ell(\boldsymbol{\theta})\) denotes the log-likelihood and quantifies the lack-of-fit between the model and the data, \(P(\boldsymbol{\theta})\) is a regularization function that imposes structure on parameter estimates, and \(\rho\) is a nonnegative regularization strength parameter that trades off the model fit encoded in \(\ell(\boldsymbol{\theta})\) with the desired structure encoded in \(P(\boldsymbol{\theta})\). For many problems of this form, statistical inference for \(\theta\) either did not exist or require substantial problem-specific analysis such as developing new priors or deriving analytic results. For example, extending post-selection inference results for Lasso to the setting where additional constraints are incorporated (constrained Lasso) is not trivial.
My work in this area aims to make statistical inference for this type of problems much easier and more automatic. It addresses these challenges by integrating ideas from optimization literature, such as Moreau-Yosida envelopes and proximal maps, with the powerful Bayesian inference machinery. This approach has led to ProxMCMC, a flexible, general, and fully Bayesian inference framework for constrained and regularized estimation.
Related publication
Software