Data Management and Analysis Core (DMAC)

Our Center encompasses research from many different disciplines, all requiring specialized statistical methods. In some cases, we need entirely new methods in order to better understand and model complex relationships. We also need a data repository so that others may easily access the tools and databases that we are developing. The DMAC provides this kind of support across all projects.

Our Goals

The goal of the DMAC is to provide statistical, bioinformatics, and data management support for all aspects of the project.

Our Approach

Our team is known for developing innovative statistical methods to help solve questions around environmental exposures and to integrate high dimensional exposure, molecular, and phenotypic data. We also provide support in geographic information systems (GIS) and statistics education and training for trainees and researchers connected with the SRC, and engage pre- and post-doctoral trainees from the Biostatistics Department in SRC projects.

 

DMAC Team

DMAC News

More

Recent Publications

Nilanjana Laha and Rajarshi Mukherjee. 2023. “On Support Recovery with Sparse CCA: Information Theoretic and Computational Limits.” IEEE Trans Inf Theory, 69, 3, Pp. 1695-1738. Publisher's VersionAbstract

In this paper, we consider asymptotically exact support recovery in the context of high dimensional and sparse Canonical Correlation Analysis (CCA). Our main results describe four regimes of interest based on information theoretic and computational considerations. In regimes of "low" sparsity we describe a simple, general, and computationally easy method for support recovery, whereas in a regime of "high" sparsity, it turns out that support recovery is information theoretically impossible. For the sake of information theoretic lower bounds, our results also demonstrate a non-trivial requirement on the "minimal" size of the nonzero elements of the canonical vectors that is required for asymptotically consistent support recovery. Subsequently, the regime of "moderate" sparsity is further divided into two subregimes. In the lower of the two sparsity regimes, we show that polynomial time support recovery is possible by using a sharp analysis of a co-ordinate thresholding [1] type method. In contrast, in the higher end of the moderate sparsity regime, appealing to the "Low Degree Polynomial" Conjecture [2], we provide evidence that polynomial time support recovery methods are inconsistent. Finally, we carry out numerical experiments to compare the efficacy of various methods discussed.

Hufeng Zhou, Theodore Arapoglou, Xihao Li, Zilin Li, Xiuwen Zheng, Jill Moore, Abhijith Asok, Sushant Kumar, Elizabeth E Blue, Steven Buyske, Nancy Cox, Adam Felsenfeld, Mark Gerstein, Eimear Kenny, Bingshan Li, Tara Matise, Anthony Philippakis, Heidi L Rehm, Heidi J Sofia, Grace Snyder, Grace Snyder, Zhiping Weng, Benjamin Neale, Shamil R Sunyaev, and Xihong Lin. 2023. “FAVOR: functional annotation of variants online resource and annotator for variation across the human genome.” Nucleic Acids Res, 51, D1, Pp. D1300-D1311.Abstract

Large biobank-scale whole genome sequencing (WGS) studies are rapidly identifying a multitude of coding and non-coding variants. They provide an unprecedented resource for illuminating the genetic basis of human diseases. Variant functional annotations play a critical role in WGS analysis, result interpretation, and prioritization of disease- or trait-associated causal variants. Existing functional annotation databases have limited scope to perform online queries and functionally annotate the genotype data of large biobank-scale WGS studies. We develop the Functional Annotation of Variants Online Resources (FAVOR) to meet these pressing needs. FAVOR provides a comprehensive multi-faceted variant functional annotation online portal that summarizes and visualizes findings of all possible nine billion single nucleotide variants (SNVs) across the genome. It allows for rapid variant-, gene- and region-level queries of variant functional annotations. FAVOR integrates variant functional information from multiple sources to describe the functional characteristics of variants and facilitates prioritizing plausible causal variants influencing human phenotypes. Furthermore, we provide a scalable annotation tool, FAVORannotator, to functionally annotate large-scale WGS studies and efficiently store the genotype and their variant functional annotation data in a single file using the annotated Genomic Data Structure (aGDS) format, making downstream analysis more convenient. FAVOR and FAVORannotator are available at https://favor.genohub.org.

Zhonghua Liu, Jincheng Shen, Richard Barfield, Joel Schwartz, Andrea A Baccarelli, and Xihong Lin. 2022. “Large-Scale Hypothesis Testing for Causal Mediation Effects with Applications in Genome-wide Epigenetic Studies.” J Am Stat Assoc, 117, 537, Pp. 67-81.Abstract

In genome-wide epigenetic studies, it is of great scientific interest to assess whether the effect of an exposure on a clinical outcome is mediated through DNA methylations. However, statistical inference for causal mediation effects is challenged by the fact that one needs to test a large number of composite null hypotheses across the whole epigenome. Two popular tests, the Wald-type Sobel's test and the joint significant test using the traditional null distribution are underpowered and thus can miss important scientific discoveries. In this paper, we show that the null distribution of Sobel's test is not the standard normal distribution and the null distribution of the joint significant test is not uniform under the composite null of no mediation effect, especially in finite samples and under the singular point null case that the exposure has no effect on the mediator and the mediator has no effect on the outcome. Our results explain why these two tests are underpowered, and more importantly motivate us to develop a more powerful Divide-Aggregate Composite-null Test (DACT) for the composite null hypothesis of no mediation effect by leveraging epigenome-wide data. We adopted Efron's empirical null framework for assessing statistical significance of the DACT test. We showed analytically that the proposed DACT method had improved power, and could well control type I error rate. Our extensive simulation studies showed that, in finite samples, the DACT method properly controlled the type I error rate and outperformed Sobel's test and the joint significance test for detecting mediation effects. We applied the DACT method to the US Department of Veterans Affairs Normative Aging Study, an ongoing prospective cohort study which included men who were aged 21 to 80 years at entry. We identified multiple DNA methylation CpG sites that might mediate the effect of smoking on lung function with effect sizes ranging from -0.18 to -0.79 and false discovery rate controlled at level 0.05, including the CpG sites in the genes AHRR and F2RL3. Our sensitivity analysis found small residual correlations (less than 0.01) of the error terms between the outcome and mediator regressions, suggesting that our results are robust to unmeasured confounding factors.

Xihao Li, Godwin Yung, Hufeng Zhou, Ryan Sun, Zilin Li, Kangcheng Hou, Martin Jinye Zhang, Yaowu Liu, Theodore Arapoglou, Chen Wang, Iuliana Ionita-Laza, and Xihong Lin. 2022. “A multi-dimensional integrative scoring framework for predicting functional variants in the human genome.” Am J Hum Genet, 109, 3, Pp. 446-456.Abstract

Attempts to identify and prioritize functional DNA elements in coding and non-coding regions, particularly through use of in silico functional annotation data, continue to increase in popularity. However, specific functional roles can vary widely from one variant to another, making it challenging to summarize different aspects of variant function with a one-dimensional rating. Here we propose multi-dimensional annotation-class integrative estimation (MACIE), an unsupervised multivariate mixed-model framework capable of integrating annotations of diverse origin to assess multi-dimensional functional roles for both coding and non-coding variants. Unlike existing one-dimensional scoring methods, MACIE views variant functionality as a composite attribute encompassing multiple characteristics and estimates the joint posterior functional probabilities of each genomic position. This estimate offers more comprehensive and interpretable information in the presence of multiple aspects of functionality. Applied to a variety of independent coding and non-coding datasets, MACIE demonstrates powerful and robust performance in discriminating between functional and non-functional variants. We also show an application of MACIE to fine-mapping and heritability enrichment analysis by using the lipids GWAS summary statistics data from the European Network for Genetic and Genomic Epidemiology Consortium.

Michael Leung, Sebastian T Rowland, Brent A Coull, Anna M Modest, Michele R Hacker, Joel Schwartz, Marianthi-Anna Kioumourtzoglou, Marc G Weisskopf, and Ander Wilson. 2023. “Bias Amplification and Variance Inflation in Distributed Lag Models Using Low-Spatial-Resolution Data.” Am J Epidemiol, 192, 4, Pp. 644-657. Publisher's VersionAbstract

Distributed lag models (DLMs) are often used to estimate lagged associations and identify critical exposure windows. In a simulation study of prenatal nitrogen dioxide (NO2) exposure and birth weight, we demonstrate that bias amplification and variance inflation can manifest under certain combinations of DLM estimation approaches and time-trend adjustment methods when using low-spatial-resolution exposures with extended lags. Our simulations showed that when using high-spatial-resolution exposure data, any time-trend adjustment method produced low bias and nominal coverage for the distributed lag estimator. When using either low- or no-spatial-resolution exposures, bias due to time trends was amplified for all adjustment methods. Variance inflation was higher in low- or no-spatial-resolution DLMs when using a long-term spline to adjust for seasonality and long-term trends due to concurvity between a distributed lag function and secular function of time. NO2-birth weight analyses in a Massachusetts-based cohort showed that associations were negative for exposures experienced in gestational weeks 15-30 when using high-spatial-resolution DLMs; however, associations were null and positive for DLMs with low- and no-spatial-resolution exposures, respectively, which is likely due to bias amplification. DLM analyses should jointly consider the spatial resolution of exposure data and the parameterizations of the time trend adjustment and lag constraints.

More