Multiple Outliers in Linear Regression: Advances in Detection Methods, Robust Estimation, and Variable Selection
Abstract:
Empirical evidence suggests unusual or outlying observations in data sets are much more prevalent than one might expect 5 to 10 on average for many industries. This research addresses multiple outliers in the linear regression model. Although reliable for a single or a few outliers, standard diagnostic techniques from an ordinary least squares OLS fit can fail to identify multiple outliers. The parameter estimates, diagnostic quantities and model inferences from the contaminated data set can be significantly different from those obtained with the clean data. The researcher requires a dependable method to identify and accommodate these multiple outliers. This research tests both direct methods from algorithms and indirect methods from robust regression estimators to identify multiple outliers. A comprehensive Monte Carlo simulation study evaluates the impact that outlier density and geometry, regressor variable dimension, and outlying distance have on numerous published methods. The performance study focuses on outlier configurations likely to be encountered in practice and uses a designed experiment approach. The results for each scenario provide insight and limitations in performance for each technique. Recommendations are given for each technique. OLS is the optimal regression estimator under a set of assumptions on the distribution of the error term and predictor variables. Compound robust regression estimators have been proposed as alternatives when some OLS assumptions fail. Compound estimators can accommodate multiple outliers and limit the influence of the observations with remote levels of predictor variables. This research proposes a new compound estimator that is more effective for extreme observations in X space and high dimension than currently published methods. This research also addresses the variable selection problem for compound robust regression estimators.