Research

My current research focuses on developing flexible and scalable functional data analysis methods for modeling modern health data collected via wearable and implantable technologies (WIT), which allows for high-resolution, around-the-clock monitoring of critical signals from the human body. The complexity and scale of the data generated by these technologies challenge existing statistical methods, require novel analytic tools, and motivate my research.

High-resolution hemodynamic data

I am collaborating with anesthesiologists from the Johns Hopkins Medical Institute to decode high-resolution hemodynamic data collected from thousands of cardiac surgeries. Examples of hemodynamic data include mean arterial pressure (MAP), central venous pressure (CVP), cardiac index (CI), and many others.

Proper organ function depends on an adequate supply of oxygen-rich blood, which, in turn, is influenced by hemodynamic factors. Previous studies have identified a strong association between hemodynamic values and the risk of cardiac surgery-associated acute kidney injury (CSA-AKI), but it remains unknown how organ damage accumulates.

My work in this area focuses on developing nonparametric methods that can handle the complexity of repeated hemodynamic zone exposure with the goal of flexibly modeling organ damage accumulation.


Generalized multilevel functional data

The figure below shows my glucose data collected over six days using Dexcom’s Stelo device. I wanted to know when my glucose levels fall outside the recommended range for healthy individuals (70-140 mg/dL, indicated by the green shaded regions). This information can be represented as multilevel (for the multiple days) binary functional data, indicating whether glucose values are within or outside the target range. Other examples of binary functional data include hemodynamic data (inside or outside risky hemodynamic zones) and physical activity data collected via accelerometers (active vs. inactive periods).

Despite the abundance of generalized (binary, count, …) multilevel functional data, like those mentioned, current inferential methods are not well-suited to handle the size, complexity, and structure of such datasets.

My work in this area has led to the generalized multilevel functional principal component analysis (GM-FPCA), a novel and scalable method for extracting dominant modes of variation from such data, enabling effective inference and predictive modeling.

Related publication

My continuous glucose data collected using Dexcom's Stelo.

Functional random effect inference

Given continuous monitoring data from a subject over the past 10 days, can we determine with high confidence, and with uncertainty quantification, whether the observations from day 11 appear normal or anomalous? This problem has broad applications, from medicine to engineering, but addressing it requires reliable inference on the functional random effect — a topic that has been underexplored.

My work in this area focuses on developing scalable methods to enable functional random effect inference not only for Gaussian functional data but also for other types of generalized functional data.

Physical activity data from two particicants (left and right) in the NHANES accelerometry study. The y-axis is log-transformed physical activity intensity measured in the Monitor-Independent Movement Summary (MIMS) unit, and the x-axis is time from midnight to midnight.





Prior work


Inference for massive and distributed longitudinal data

Longitudinal data are ubiquitous in medical research and linear mixed models (LMMs) are powerful tools for analyzing them. Statistical inference of variance component parameters has traditionally relied on bootstrap, which becomes prohibitively slow on large datasets.

My work in this area has led to the Bag of Little Bootstrap method for massive longitudinal data, which reduces the computational cost from \(O(BNq^3)\) to \(O(Bbq^3)\), where \(B\) is the number of bootstrap replicates, \(N\) is the number of subjects, \(q\) is the number of random effect parameters, and \(b \ll N\). It achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.

Related publication

Software

Relative error versus processing time on 1 million subjects and 20 million total observations. Our method finishes all calculations within 170 seconds, which is more than 200 times faster than bootstrap, and achieves lower relative error.

Inference for constrained and regularized estimation problems

Many statistical learning tasks including Lasso, constrained Lasso, graphical Lasso, matrix completion, and sparse low rank matrix regression can be posed as regularized and/or constrained maximum likelihood estimation that require solving optimization problems of the form

\[\text{maximize} \quad \ell(\boldsymbol{\theta}) - \rho P(\boldsymbol{\theta}),\]

where \(\boldsymbol{\theta}\) denotes model parameters, \(\ell(\boldsymbol{\theta})\) denotes the log-likelihood and quantifies the lack-of-fit between the model and the data, \(P(\boldsymbol{\theta})\) is a regularization function that imposes structure on parameter estimates, and \(\rho\) is a nonnegative regularization strength parameter that trades off the model fit encoded in \(\ell(\boldsymbol{\theta})\) with the desired structure encoded in \(P(\boldsymbol{\theta})\). For many problems of this form, statistical inference for \(\theta\) either did not exist or require substantial problem-specific analysis such as developing new priors or deriving analytic results. For example, extending post-selection inference results for Lasso to the setting where additional constraints are incorporated (constrained Lasso) is not trivial.

My work in this area aims to make statistical inference for this type of problems much easier and more automatic. It addresses these challenges by integrating ideas from optimization literature, such as Moreau-Yosida envelopes and proximal maps, with the powerful Bayesian inference machinery. This approach has led to ProxMCMC, a flexible, general, and fully Bayesian inference framework for constrained and regularized estimation.

Related publication

Software

ProxMCMC for constrained lasso with a sum-to-zero constraint on model parameters. From left to right: 95% credible intervals for model parameters where dots mark the truth. Histogram of the sum of model parameters over 10,000 samples. Coverage probability for model coefficients calculated from 1,000 simulated data sets where the red line marks the nominal level of 95%.