Statistical methods in biological and biomedical sciences

Research Focus 

Our recent research efforts are focused on statistical methods development for biological and biomedical sciences. We and colleagues have developed multiple innovative and effective statistical approaches and computational toolkits for interrogating gene expression data in microarray experiments and for localizing genetic variants functional to complex diseases in diverse populations.

Currently, we are developing powerful statistical procedures for identifying pleiotropic genes which are functional to a given set of genetically correlated diseases. These methods will integrate different types of data, e.g., the data in next-generation sequencing, genome-wide genetic association, epigenomics, transcriptomics and proteomics studies.

Studies

Integrating admixture mapping with association testing

In an admixed population, e.g., African American, due to the admixture of chromosomal segments from distinct ancestry populations, marker-specific genotype and ancestry are associated with each other and single-nucleotide polymorphisms (SNPs) may be in complicated linkage disequilibrium (LD). SNP and admixture associations have been separately well investigated in the literature. Admixture mapping exploits the long-range LD due to the admixture between ancestral populations. In the presence of dense markers, genome-wide association studies have discovered thousands of individual variants associated with human diseases/traits. 

To integrate the two pieces of information, we developed a two-stage method to localize causal SNPs that are ancestral informative markers of two ancestral populations. At the first stage, admixture screening is performed to define candidate genetic regions by ancestry association evidence and ancestry informative extent. At the second stage, single-SNP association tests are performed within the candidate regions and claim significance by a permutation-based significance threshold. The permutation-based threshold is used to account for ancestry-genotype dependence and marker-to-marker LD. Under our simulations, the new method was more powerful than the direct SNP-based genome-wide association test when the difference in ancestry allele frequency of a causal variant is larger than 0.4. This method is supposed to identify specific causal markers other than genomic regions which harbor causal markers.

Genuine associations with diseases/traits may occur for certain variants in the absence of admixture signals. For such variants, the above two-stage method may not be as powerful as a conventional one-stage method. Thus, the proposed two-stage approach should be considered as an instructive complementary method to the conventional association approach. Some integrative alternatives for formally testing all variants are ongoing.

Interrogating local population structure for fine mapping

In human genetics, admixed populations offer a unique opportunity for mapping diseases with allele frequency discrepancies between ancestral populations. Adjustment for population structure is necessary to avoid bias in genetic association studies of susceptibility variants for complex diseases. Population structure may differ from one genomic region to another due to the variability of individual ancestry associated with migration, random genetic drift or natural selection. Current methods for correcting population stratification usually involve adjustment of individual global ancestries.

To more accurately localize true casual genes for complex diseases in admixed populations, we advocated interrogating local population structure for fine mapping through better adjusting for the confounding effect due to local ancestry. By extensive simulations on genome-wide data sets, we illustrated that adjusting global ancestry principal components (PCs) alone would lead to false positives when local population structure was an important confounding factor. In the contrast, adjusting local ancestry PCs was able to effectively prevent false positives due to local population structure and thus improve fine mapping for disease gene localization. We applied local and global PCs adjustments in the analysis of data sets of three genome-wide association studies, including European Americans, African Americans and Nigerians. Both the European Americans and the African Americans demonstrated greater variability in local ancestry than did Nigerians. Adjusting local ancestry PCs successfully eliminated the known spurious association between SNPs in the lactase gene and height due to the population structure existed in European Americans.

Further, we theoretically proved that local ancestry at a test SNP may confound the association signal and empirically illustrated that local ancestry adjustment at the test SNP is necessary and enlightening to remove the spurious association due to local ancestries. We modeled the genotype distribution of the testing SNP given disease status and flanking marker genotypes and used a conditional likelihood framework to develop a novel association test. A key advantage of this method is to incorporate different directions of association in the ancestral populations.

Data-driven weighting in genetic association analysis

For pedigree-based genome-wide association studies, we proposed a powerful data-driven weighting scheme to enhance the statistical power to identify significant variants. The standard top R approach is preferable to standard one-stage approach if R—the number of markers retained at the screening stage for formal testing—is carefully selected. The non-informative exponential weighting scheme (EWS) ensures all the genotyped markers to be formally tested. The power of the EWS depends on the size of the smallest marker group. Both approaches only allow for nuclear families and discard parental information when available. The literature has seen an extended top R approach that incorporates founder information while allowing for general pedigrees. In practice, however, the choice of R remains a crucial challenge. Our new scheme integrates a novel informative weighting scheme with strong advantages of the EWS and the extended top R and overcomes their limitations. In terms of power, it outperformed optimal extended top R and optimal EWS approaches and was robust to population stratification to reasonable levels. It controlled family-wise error rate at a desired level regardless of LD structure.

Population and family designs have been well separately addressed with notable efforts on two-stage schemes. The trick of vast most existing two-stage schemes is to formally test the R most promising markers. We extended our informative weighting to efficiently utilize available natural population and family resources and ensure that all SNPs are formally tested. Theoretically, the new approach can rigorously control the family-wise error rate at a preset nominal level. Empirically, the new scheme has proved more powerful than prevailing one-stage and two-stage approaches, e.g., standard population-based score test as well as standard pedigree disequilibrium test, family-based optimal exponential weighting scheme and optimal top R approach.

Microarray data clustering for gene discovery

In microarray studies, a central challenge is to reliably identify as many biologically significant genes as possible while controlling false positives. We proposed an efficient method to identify differentially expressed genes in microarray experiments. This method incorporates a novel stratification-based tight clustering algorithm, principal component analysis and information pooling. It proved substantially more powerful than the popular SAM and eBayes approaches. The proposed method applies to both low- and high-replication microarray experiments. The method has been applied to three real microarray datasets: one from a Populus nitrogen deficiency experiment with 3 biological replicates; and two from public microarray datasets of colon cancer and leukemia with 10 to 40 biological replicates. In all three analyses, our method proved more robust than the popular alternatives for identification of differentially expressed genes.

Basically, existing correlation-based methods in this area were developed by intuitive ideas, empirical experiments, and classical theory of sample correlation. Pertinent theoretical foundation and analytical strategies of gene-to-gene correlations are needed for various scenarios. To fill this bill, we obtained a generic stochastic representation and asymptotic distributions with convergence rates of gene-gene correlation with respect to the variations of differential magnitudes, residual correlations, and experiment sizes as well. This effort deepens the insights of gene-gene correlation and serves as the theoretical basis for our forward-search-based tight clustering.