McAnulty College and Graduate School of Liberal Arts
Dimension Reduction; Partial Least Squares; Penalized Regression; Predictive Modeling; Regression; Selective Inference
Several problems arise when attempting to use traditional predictive modeling techniques on ‘big data.’ For instance, multiple linear regression models cannot be used on datasets with hundreds of variables. However several techniques are becoming common tools for selective inference as the need for analyzing big data increases. Forward selection and penalized regression models (such as LASSO, Ridge Regression, and Elastic Net) are simple modiﬁcations of multiple linear regression that can provide some guidance on simplifying a model through variable selection. Dimension reducing techniques, such as Partial Least Squares and Principal Components Analysis, are more complex than regression but have the ability to handle highly correlated independent variables. Each of the aforementioned techniques are valuable in predictive modeling if used properly. This paper provides a mathematical introduction to these developments in selective inference. A sample dataset is used to demonstrate modeling and interpretation. Further, the applications to big data, as well as advantages and disadvantages of each procedure, are discussed.
Papke, S. (2017). A Review of 'Big Data' Variable Selection Procedures For Use in Predictive Modeling (Master's thesis, Duquesne University). Retrieved from https://dsc.duq.edu/etd/182