Robust techniques for linear regression with multicollinearity and outliers

The ordinary least squares (OLS) method is the most commonly used method in multiple linear regression model due to its optimal properties and ease of computation. Unfortunately, in the presence of multicollinearity and outlying observations in a data set, the OLS estimate is inefficient with inflat...

Full description

Bibliographic Details
Main Author: Mohammed, Mohammed Abdulhussein
Format: Thesis
Language:English
Published: 2016
Subjects:
Online Access:http://psasir.upm.edu.my/id/eprint/58669/1/IPM%202016%201IR%20D.pdf
_version_ 1825931855372746752
author Mohammed, Mohammed Abdulhussein
author_facet Mohammed, Mohammed Abdulhussein
author_sort Mohammed, Mohammed Abdulhussein
collection UPM
description The ordinary least squares (OLS) method is the most commonly used method in multiple linear regression model due to its optimal properties and ease of computation. Unfortunately, in the presence of multicollinearity and outlying observations in a data set, the OLS estimate is inefficient with inflated standard errors. Outlying observations can be classified into different types, such as vertical outlier, high leverage points (HLPs) and influential observations (IO). It is very crucial to identify HLPs and IO because of their responsibility for having large effect on various estimators, causing masking and swamping of outliers in multiple linear regression. All the commonly used diagnostic measures fail to correctly identify those observations. Hence, a new improvised diagnostic robust generalized potential (IDRGP) is proposed. The proposed IDRGP is very successful in detecting multiple HLPs with smaller masking and swamping rates. This thesis also concerned on the diagnostic measures for the identification of bad influential observations (BIO). The detection of BIO is very important because it is accountable for inaccurate prediction and invalid inferential statements as it has large impact on the computed values of various estimates. The Generalized version of DFFITS (GDFF) was developed only to identify IO without taking into consideration whether it is good or bad influential observations. In addition, although GDFF can detect multiple IO, it has a tendency to detect lesser IO as it should be due to swamping and masking effect. A new proposed method which is called the modified generalized DFFITS (MGDFF) is developed in this regard, whereby the suspected HLPs in the initial subset are identified using our proposed IDRGP diagnostic method. To the best of our knowledge, no research is done on the classification of observations into regular, good and bad IOs. Thus, the IDRGP-MGDFF plot is formulated to close the gap in the literature. This thesis also addresses the issue of multicollinearity problem in multiple linear regression models with regards to two sources. The first source is due to HLPs and the second source of multicollinearity problem is caused by the data collection method employed, constraints on the model or in the population,model specification and an over defined model. However, no research is focused on the parameter estimation method to remedy the problem of multicollinearity which is due to multiple HLPs. Hence, we propose a new estimation method namely the modified GM-estimator (MGM) based on MGDFF. The results of the study indicate that the MGM estimator is the most efficient method to rectify the problem of multicollinearity which is caused by HLPs. When multicollinearity is due to other sources (not HLPs), several classical methods are available. Among them, the Ridge Regression (RR), Jackknife Ridge Regression (JRR) and Latent Root Regression (LRR) are put forward to remedy this problem. Nevertheless, it is now evident that these classical estimation methods perform poorly when outliers exist in a data. In this regard, we propose two types of robust estimation methods. The first type is an improved version of the LRR to rectify the simultaneous problems of multicollinearity and outliers. The proposed method is formulated by incorporating robust MM-estimator and the modified generalized M-estimator (MGM) in the LRR algorithm. We call these methods the Latent Root MMbased (LRMMB) and the Latent Root MGM-based (LRMGMB) methods. Similar to the first type, the second type of robust multicollinearity estimation method also aims to improve the performance of the robust jackknife ridge regression. The MM-estimator and the MGM-estimator are integrated in the JRR algorithm for the establishment of the improved versions of JRR. The suggested method is called jackknife ridge MM-based denoted by JRMMB and the jackknife ridge MGM based denoted by JRMGMB. All the proposed methods outperform the commonly used methods when multicollinearity comes together with the existence of multiple HLPs. The classical multicollinearity diagnostic measure is not suitable to correctly diagnose the existence of multicollinearity in the presence of multiple HLPs. When the classical VIF is employed, HLPs will be responsible for the increased and decreased of multicollinearity pattern. This will give misleading conclusion and incorrect indicator for solving multicollinearity problem. In this respect, we propose robust VIF denoted as RVIF(JACK-MGM) which serves as good indicator that can help statistics practitioners to choose appropriate estimator to solve multicollinearity problem.
first_indexed 2024-03-06T09:33:04Z
format Thesis
id upm.eprints-58669
institution Universiti Putra Malaysia
language English
last_indexed 2024-03-06T09:33:04Z
publishDate 2016
record_format dspace
spelling upm.eprints-586692022-01-25T06:53:29Z http://psasir.upm.edu.my/id/eprint/58669/ Robust techniques for linear regression with multicollinearity and outliers Mohammed, Mohammed Abdulhussein The ordinary least squares (OLS) method is the most commonly used method in multiple linear regression model due to its optimal properties and ease of computation. Unfortunately, in the presence of multicollinearity and outlying observations in a data set, the OLS estimate is inefficient with inflated standard errors. Outlying observations can be classified into different types, such as vertical outlier, high leverage points (HLPs) and influential observations (IO). It is very crucial to identify HLPs and IO because of their responsibility for having large effect on various estimators, causing masking and swamping of outliers in multiple linear regression. All the commonly used diagnostic measures fail to correctly identify those observations. Hence, a new improvised diagnostic robust generalized potential (IDRGP) is proposed. The proposed IDRGP is very successful in detecting multiple HLPs with smaller masking and swamping rates. This thesis also concerned on the diagnostic measures for the identification of bad influential observations (BIO). The detection of BIO is very important because it is accountable for inaccurate prediction and invalid inferential statements as it has large impact on the computed values of various estimates. The Generalized version of DFFITS (GDFF) was developed only to identify IO without taking into consideration whether it is good or bad influential observations. In addition, although GDFF can detect multiple IO, it has a tendency to detect lesser IO as it should be due to swamping and masking effect. A new proposed method which is called the modified generalized DFFITS (MGDFF) is developed in this regard, whereby the suspected HLPs in the initial subset are identified using our proposed IDRGP diagnostic method. To the best of our knowledge, no research is done on the classification of observations into regular, good and bad IOs. Thus, the IDRGP-MGDFF plot is formulated to close the gap in the literature. This thesis also addresses the issue of multicollinearity problem in multiple linear regression models with regards to two sources. The first source is due to HLPs and the second source of multicollinearity problem is caused by the data collection method employed, constraints on the model or in the population,model specification and an over defined model. However, no research is focused on the parameter estimation method to remedy the problem of multicollinearity which is due to multiple HLPs. Hence, we propose a new estimation method namely the modified GM-estimator (MGM) based on MGDFF. The results of the study indicate that the MGM estimator is the most efficient method to rectify the problem of multicollinearity which is caused by HLPs. When multicollinearity is due to other sources (not HLPs), several classical methods are available. Among them, the Ridge Regression (RR), Jackknife Ridge Regression (JRR) and Latent Root Regression (LRR) are put forward to remedy this problem. Nevertheless, it is now evident that these classical estimation methods perform poorly when outliers exist in a data. In this regard, we propose two types of robust estimation methods. The first type is an improved version of the LRR to rectify the simultaneous problems of multicollinearity and outliers. The proposed method is formulated by incorporating robust MM-estimator and the modified generalized M-estimator (MGM) in the LRR algorithm. We call these methods the Latent Root MMbased (LRMMB) and the Latent Root MGM-based (LRMGMB) methods. Similar to the first type, the second type of robust multicollinearity estimation method also aims to improve the performance of the robust jackknife ridge regression. The MM-estimator and the MGM-estimator are integrated in the JRR algorithm for the establishment of the improved versions of JRR. The suggested method is called jackknife ridge MM-based denoted by JRMMB and the jackknife ridge MGM based denoted by JRMGMB. All the proposed methods outperform the commonly used methods when multicollinearity comes together with the existence of multiple HLPs. The classical multicollinearity diagnostic measure is not suitable to correctly diagnose the existence of multicollinearity in the presence of multiple HLPs. When the classical VIF is employed, HLPs will be responsible for the increased and decreased of multicollinearity pattern. This will give misleading conclusion and incorrect indicator for solving multicollinearity problem. In this respect, we propose robust VIF denoted as RVIF(JACK-MGM) which serves as good indicator that can help statistics practitioners to choose appropriate estimator to solve multicollinearity problem. 2016-01 Thesis NonPeerReviewed text en http://psasir.upm.edu.my/id/eprint/58669/1/IPM%202016%201IR%20D.pdf Mohammed, Mohammed Abdulhussein (2016) Robust techniques for linear regression with multicollinearity and outliers. Doctoral thesis, Universiti Putra Malaysia. Regression analysis Multicollinearity
spellingShingle Regression analysis
Multicollinearity
Mohammed, Mohammed Abdulhussein
Robust techniques for linear regression with multicollinearity and outliers
title Robust techniques for linear regression with multicollinearity and outliers
title_full Robust techniques for linear regression with multicollinearity and outliers
title_fullStr Robust techniques for linear regression with multicollinearity and outliers
title_full_unstemmed Robust techniques for linear regression with multicollinearity and outliers
title_short Robust techniques for linear regression with multicollinearity and outliers
title_sort robust techniques for linear regression with multicollinearity and outliers
topic Regression analysis
Multicollinearity
url http://psasir.upm.edu.my/id/eprint/58669/1/IPM%202016%201IR%20D.pdf
work_keys_str_mv AT mohammedmohammedabdulhussein robusttechniquesforlinearregressionwithmulticollinearityandoutliers