Monday, June 27, 2016

Multiple Regression. 

Reflections on a complex tool. 


Anybody with a basic math background knows the equation of a line is y = a + bx. We also know that the concept of Regression is to find a line that better adjusts or "fits" our data on a particular plane. Let us take, for example, y= 3+2x and add a random number from a normal distribution with a mean of 0 and s=5 to create some dispersion. There are many considerations for a sample size to be "enough" for Regression but let's take 40 as a rule of thumb (never less than 15). Those points on a graph could look like this: 



After fitting the regression line and adding a Confidence Interval (95%; green dotted lines in the image below), the graph will now show an area of probability where the true line is.



It can also be observed that approximately 95% of the variation in "y" could be explained by the variation in "x" (R-sq = 95%). Also,  S=5 (which we already knew) means the line equation +/- 2S should cover the area of where the value of "y" truly is at least 95% of the time (property of normal distribution). 

But this is all for the data that's already there. If we wanted to guess where the next value of "y" is going to fall using that same range for "x" then we use the Prediction Interval (95%; orange dotted lines in the image below).


This should be no surprise to anybody as it is obvious that your ability to predict where the next point is going to land depends on the overall variation of the system. Is like trying to throw an arrow with a shaking hand: most likely it will not land on the exact same place! 

Now let's add another axis using a different variable in an attempt to explain "y" better. This would add depth to our graph and we'll be working on a tridimensional space.



Because of software and time limitations, let's imagine: if you fit the line for "y" and "X1" and then extended across "X2" you have a plane (like ironing the wrinkles on the graph above). If the regression line remains at the exact same height it means that "X2" has no interaction with "X1". You can also add the Confidence and Prediction Intervals on the top and bottom of that plane. From here a lot of reflections can be made (some mindblowing): what's happening if the "X1" regression line twists across "X2" (like a helix)? what if it bends while traveling thru "X2"

This is only important where the regression equation is y = a + b(X1*X2). If there is no significant interaction and the equation is y = a + bX1 + cX2, then it could be seen as if the original regression with "X1" just gets tilted by the presence of "X2"... a much simpler case.

Anyway, multiple regression means that at least two variables are having a significant effect on the output been measured. The situation gets more complex as you add more "Xs" and they interact in different ways (like in real life, isn't it?) and even more if different types of variables are added (i.e. ordinal attributes). 

Regardless of how hard it gets, it is always beneficial to try to visualize it. The better the intuition, the fewer chances we are lost in the numbers.

No comments:

Post a Comment