The Methodology Center | Advancing methods, improving health

High-Dimensional Data Analysis

DNA

High-dimensional data, including genetic data, are becoming increasingly available as data collection technology evolves. Behavioral scientists need powerful, effective analytic methods to glean maximum scientific insight from these data. 

    Over the last few years, Runze Li and other statisticians have been developing new methods for analyzing high-dimensional data. Now, Center researchers are extending these methods for use in behavioral research focused on, for example, preventing drug abuse and HIV-risk behavior. Future statistical work will develop methods to analyze genetic data simultaneously with intensive longitudinal data. This work will allow scientists to identify which genetic, individual, and social factors predict drug abuse, HIV-risk behavior, and related health behaviors.

      

     

    High-Dimensional Variable Screening

    In genetic studies, the number of variables is extremely large relative to the number of participants: there may be hundreds of subjects and hundreds of thousands of variables. This has a crippling effect on exploratory data analyses because nearly all multivariate procedures break down when the number of variables exceeds the sample size. As a result, it is necessary to reduce the number of variables to a subset of predictors that potentially impact the outcome of interest. High-dimensional variable-screening procedures allow researchers to narrow the subset of variables for the analysis. 

      

    Read about our statistical work in variable screening

     

      

    High-Dimensional Variable Selection

    Other types of genetic studies focus on specific genes. This creates a situation in which the sample size is somewhat larger than the number of predictors (e.g., 500 subjects and 300 variables). In these situations, many variables are often highly correlated. A complicated model may include many insignificant variables, and it may have less predictive power and be difficult to interpret. In these cases, a more parsimonious model becomes desirable. Approaches such as penalized least squares and penalized likelihood with the smoothly clipped absolute deviation (SCAD) penalty can select significant variables. We are developing broadly applicable techniques for high-dimensional variable selection. We also developed PROC SCAD, a pair of SAS procedures using the SCAD penalty for high-dimensional variable selection.

     

    Read about our statistical work in variable selection, organized by data type

    Researchers

    Runze LiLead researcher: Runze Li

     

    Other researchers: 

    Anne Buu

    John Dziak

    Definitions

    View All

    , ,

     

    Funding

    Our research on variable selection is/was supported by the National Science Foundation grants DMS 0102505, DMS 0322673, DMS 0348869, CCF 0430349 and DMS 0722351 and the National Institute on Drug Abuse grants P50 DA10075 and National Institutes of Health Roadmap grant R21 DA024260.

    Follow Us