A Multiple Linear Regression… “Wait what? Did I read that correctly?”
Regressions are mathematical processes in which relationships between variables are observed. There are several types of regressions that can be conducted on the data, namely; linear, polynomial, logistic, exponential, power, etc. By definition, a linear regression is the technique for finding the mathematical relationship between dependent and independent variables. It finds the line of best fit for the data with a mathematical approach, and is only dependent on the data, without any human interaction and involvement. Linear regressions are essential because they allow future predictions to be made based on the data. However, the real world is complex, and often, it is hard to find a relationship between one dependent variable and one independent variable because there is a possibility of several factors being involved, or required for a solution. In class, we went over how a correlation does not always imply cause and effect. By using the concept of multiple regression, the common cause factors can be mathematically represented within the regression. The original least squares method of developing a linear regression is perhaps the most widely used type of regression for predicting the values of one dependent variable from the independent variable; however, it is also widely used to predict the value of one dependent variable from one or more independent variables. When that is the situation, the process is termed as a multiple linear regression. One of the more common examples of a multiple linear regression pertains to job satisfaction, wherein several variables such as salary, years of employment, age, gender and family status might influence one’s job satisfaction.
The steps in forming a multiple regression are almost the same as forming a linear regression. To begin with, you must form the research hypothesis followed by the null hypothesis. Next, you should assess each variable independently, and obtain the measures of central tendency and measures of spread. Also, question whether the variable is normally distributed or not? Then, assess the relationship that each independent variable has with the dependent variable, and determine the correlation coefficient. Are the two variables related? Next, assess the relationship that all of the independent variables have with each other, and create a correlation coefficient matrix for all of the independent variables. Furthermore, it is vital to determine whether the independent variables are too highly correlated with each other? An effective way to do this is to form a correlation matrix, one which describes the effect of each variable on the other, taking into consideration significance and the Pearson coefficient. Consequently, the regression equation must be formed. This is often done using the aid of technology due to the strenuous and difficult process to come up with the equation. Following that, the correlation coefficient is analyzed and appropriate decisions are made based on the hypothesis. However, despite knowing all the steps involved in conducting a multiple regression, you must still be wondering what the equation actually looks like.
Multiple RegressionSo for instance, you were trying to determine someone’s height, after completing puberty. In this case, said person’s height would be “Y”, because the value was dependent on other variables and factors. On the other hand, B1 could represent the mother’s height, and the effect that has on a person’s height while X1 would be a certain mother’s height, if we were to determine the height of their child. Likewise, B2 and X2 would represent the father’s height. The third independent variable could represent one’s gender, and be a binary variable. For instance, you would let females be represented by a zero, and males be represented by a one. Lastly, the constant, could perhaps be the average height of people before puberty. Consequently, you would compile all the pieces of data in a computerized program for the formation of a regression model, which would accurately depict the situation, because the process in which the regression is to be developed is extremely rigorous and time-consuming, which is why technology is preferred (often, the SPSS software is used). A visual representation of a multiple regression graph composed of two independent variables and one dependent variable is shown. As you may notice, the scatter plot is produced three-dimensionally, because there are three variables involved in this correlation. If there were four variables involved, perhaps a tetrahedron shaped graph would be composed, in order to provide a visual representation of the data.
However, despite the fact that the multiple regression is effective in determining relationships, and explaining the changes in data that occur because of certain independent variables, there are several issues attached to the process of multiple regressions. For one, if the data cannot be modeled by a linear graph, then a multiple regression is literally pointless, as it will not be able to provide any inference. Moreover, the issue with performing a multiple regression is the added pieces of data that are required and need to be managed. With single linear regressions, a strong correlation would require at least fifty pieces of data, however, due to the added independent variables, it would be reasonable to attain at least twenty more pieces of data per variable added. In addition, variables that do not significantly contribute to the effect of the multiple regression should be eliminated as their effects on the regressions are minimal. Consequently, the issue of multicollinearity arises when variables are added to the correlation. This occurs when two or more of the independent variables are highly correlated to each other. If a 0.75 correlation coefficient is noticed or indicated, there is an enormous probability that an issue with multicollinearity will arise. If the two variables are highly correlated, they are basically measuring the same phenomenon, and when one enters the regression equation, it is able to explain the variance in the dependent variable. This leaves little for the second independent variable to do, and often, could cause a potential skew in the regression due to multicollinearity. On the other hand, despite there being the potential for problems with multiple regressions, in this world where there are several common cause factors which could affect correlations, the option to include secondary independent variables provides researchers and analysts with strong advantages, in order to reaffirm their hypotheses.