|
Home Research Areas Variable Selection
As data collection technology and data storage devices become more powerful, scientists have become able to collect many variables and large numbers of observations in their studies. In the absence of prior knowledge, data analysts may include very many variables in their models at the initial stage of modeling in order to reduce possible model bias (approximation error). In general, a complicated model including many insignificant variables may result in less predictive power, and it may often be difficult to interpret the results. In these cases, a more parsimonious model becomes desirable in practice. Therefore, variable selection is fundamental to statistical modeling. Many approaches in use, such as stepwise selection or elimination procedures, and best subset variable selection, can be expensive in computation or ignore stochastic errors in the variable selection process. In recent work, new approaches such as the SCAD penalty are proposed to select significant variables for various statistical models. Based on penalized likelihood, the proposed approaches delete insignificant covariates by estimating their coefficients to be zero, and shrink or adjust other estimates accordingly. They therefore simultaneously select significant variables and estimate parameters. Fan and Li (2001) show that SCAD-penalized regression has an “oracle property,” namely, it can work as well in terms of asymptotic variance as if the correct significant submodel were known.
Software Downloads: Center researchers Runze Li and John Dziak have also developed a SAS plug-in procedure, PROC SCADLS, which allows a user to easily try out the SCAD procedure in the linear regression case.
Resources:
Contacts:
|