**Multiple Linear Regression is a linear regression model having more than one explanatory variable.** In our last blog, we discussed the Simple Linear Regression and R-Squared concept. The Adjusted R-Squared of our linear regression model was 0.409. However, a good model should have Adjusted R Squared 0.8 or more. To improve the model performance, we will have to use more than one explanatory variable, i.e., build a Multiple Linear Regression Model.

## Multiple Linear Regression

### Import Data

We will use the datafile **inc_exp_data.csv **to learn multiple linear regression in Python/R. Click here to download the file from the Resources section.

/* Import the File */import pandas as pd inc_exp = pd.read_csv("Inc_Exp_Data.csv") ### R code to import the File ### inc_exp <- read.csv("Inc_Exp_Data.csv")

### View Data

inc_exp.head(16)### Python syntax to view dataView(inc_exp)### R syntax to view data

### Metadata

Sr. No. | Column Name | Description |

1. | Mthly_HH_Income | Monthly Household Income |

2. | Mthly_HH_Expense | Monthly Household Expense (Dependent Variable) |

3. | No_of_Fly_Members | Number of Family Members |

4. | Emi_or_Rent_Amt | Monthly EMI or Rent Amount |

5. | Annual_HH_Income | Annual Household Income |

6. | Highest_Qualified_Member | Education Level of the Highest Qualified Member in the household |

7. | No_of_Earning_Members | Number of Earning Members |

### Hypothesis

A hypothesis is an opinion of what you expect. While framing a hypothesis, remember it should contain an independent and dependent variable. It plays a very important role in machine learning, as such, a Data Scientist should give considerate time to hypothesis development and hypothesis testing.

Herein, I am writing my hypothesis for the first 3 independent variables:

Independent Variable | Description |

Mthly_HH_Income | Household having higher monthly income are likely to have relatively higher monthly expense (positive correlation) |

No_of_Fly_Members | Families having more members are likely to have higher monthly expense (positive correlation) |

Emi_or_Rent_Amt | The EMI or Rent adds to the monthly cash outflow, as such, they may have higher overall monthly expenses as compared to households having owned residence. |

The hypothesis can be validated using graphical methods like scatter plot, numerical methods like correlation analysis and regression. We will test our hypothesis using:

- Scatter plots (pair plots)
- P-value of the independent variables in the Linear Regression model

### Pair Plots

It is a good practice to perform univariate and bivariate analyses of the data before building the models. **Pair Plots** are a really simple (one-line-of-code simple!) way to visualize relationships between each variable.

#### Inferences from Pair Plots:

1. The trends between the dependent and independent variables are as per our hypothesis.

2. The distribution of the Mthly_HH_Expense, Mthly_HH_Income, and No_of_Fly_Members variables is some-what Normal Distribution.

3. The Emi_or_Rent_Amt is highly skewed. Many observations have value as 0.

### Build the Model

#### Interpretation of Regression Summary:

1. Adjusted R-squared of the model is 0.6781. This statistic has to be read as **“67.81% of the variance in the dependent variable is explained by the model”**.

2. All the explanatory variables are statistically significant. (p-values < alpha; assume alpha = 0.0001).

3. The beta coefficient sign (+ or -) are in sync with the correlation trends observed between the dependent and the independent variables.

4. The p-value of the F Test statistic is 5.7e-12. We conclude that our linear regression model fits the data better than the model with no independent variables.

## Adjusted R-squared

Adjusted R Squared as the term suggests is R Squared with some adjustment factor. The Adjusted R Squared is a modified version of R Squared that has been adjusted for the number of predictor variables in the model.

#### Why use Adjusted R-Squared and not R-Squared?

Assume, you have a random variable having a casual relationship with the dependent variable. The addition of such a random variable to the model will still improve the model’s R-squared statistic. However, the Adjusted R Squared statistic will decrease and penalize the model if the explanatory variable does not contribute to the model. It is evident from the Adjusted R-Squared formula.

Where ** n **is No. of Records and

**is No. of Variables.**

*k*The denominator

*penalizes the R² for every additional variable. If the added variable does not improve the Model R², then the Adjusted R² value will decrease. The drop in Adjusted R² suggests the added term should be dropped from the model.*

**(n – k – 1)**

### Let us understand Adjusted R-Squared with practical example

Let us add a Sr_No column to the data and use it as an explanatory variable.

#### Interpretation

By adding Sr_No term, the R-Squared has increased from 0.6978 to 0.7022. However, Adjusted R-Squared has decreased from 0.6781 to 0.6757. As Adjusted R-Squared has decreased, the added term (Sr_No) should be dropped from the model.

### Next Blog

In our upcoming blog, we will explain the concept of Multicollinearity, Prediction using the model, and more.

<<< previous blog | next blog >>>

Linear Regression blog series home

## Recent Comments