Teach/Me Data Analysis

You are working with the text-only light edition of "H.Lohninger: Teach/Me Data Analysis, Springer-Verlag, Berlin-New York-Tokyo, 1999. ISBN 3-540-14743-8". Click here for further information.

Table of Contents Multivariate Data Modeling MLR Variable Selection	Index
See also: survey on variable selection, chance correlation, cross-validation

Variable Selection
Introduction

Sometimes a large number of independent variables, X_i, is available for a given modeling problem, and not all of these predictor variables may contribute equally well to the explanation of the predicted variable Y. Some of the independent variables may not contribute at all to the model. Thus we have to select from these variables to obtain a model which contains as little variables as possible while still being the "best" model. In principle, all possible combinations of independent variables should be tried for calculating a suitable model. This could turn out to be a formidable task, even if high performance computers are available. Besides the practicability of this approach, there are also several theoretical considerations which should be taken into account:

trying all possible combinations may lead to chance correlations
the contribution of a single variable to the explanation of Y may not easily be assessed if only a small number of observations is available
a simple criterion, like the goodness of fit, r², may lead to wrong conclusions if the number of selected variables approaches the number of observations
for more complicated models (e.g. artificial neural networks) the calculation of a single model may be so time-consuming that it is practically impossible to find the "best" combination of independent variables
the selection of combinations is guided by the available data; thus the resulting final selection reflects the "best" model for the given data set, and not the "best" subset for the population
some of the selection methods are specifically tailored to linear (regression) models; they are unusable with non-linear methods such as neural networks.

Depending on the type of model being used, there are several strategies to (partially) solve the problem:

Using all possible subsets of variables:

Mallow's C_p statistic

Stepwise procedures:

forward selection
pruning of variables (backward selection)
stepwise regression

Last Update: 2006-Jän-18

Variable SelectionIntroduction

Variable Selection
Introduction