# Linear Regression with Python | Machine learlearning | KGP Talkie

## What is Linear Regression?

You are a `real estate`

agent and you want to `predict`

the house price. It would be great if you can make some kind of `automated system`

which predict price of a house based on various input which is known as feature.

Supervised `Machine learning`

algorithms needs some data to train its `model`

before making a prediction. For that we have a Boston Dataset.

### Where can Linear Regression be used?

It is a very `powerful technique`

and can be used to understand the factors that influence `profitability`

. It can be used to `forecast sales`

in the coming months by analyzing the sales data for previous months. It can also be used to gain various insights about `customer behaviour`

.

### What is Regression?

Let’s first understand what exactly `Regression`

means it is a statistical method used in finance, investing, and other disciplines that attempts to determine the `strength`

and `character`

of the relationship between one `dependent variable`

(usually denoted by `Y`

) and a series of other variables known as `independent variables`

.`Linear Regression`

is a statistical technique where based on a set of `independent variable(s)`

a dependent variable is `predicted`

.

### Regression Examples

#### Stock prediction

We can `predict`

the price of `stock`

depends on depedent variable,`x`

. let’s say recent history of stock price,news events.

#### Tweet popularity

We can also `estimate`

number of people will `retweet`

for your tweet in tewitter based number of followers,popularity of hashtag.

#### In real estate

As we discussed earlier,We can also `predict`

the house prices and land prices in real estate.

### Regression Types

It is of two types: `simple linear regression`

and `multiple linear regression`

.**Simple linear regression:** It is characterized by an `variable quantity`

.

#### Simple Linear Regression

*y**i*=*β*0+*β*1*X**i*+*ε**i*

y = dependent variable*β*0 = population of intercept*βi* = population of co-efficient

x = independent variable*εi* = Random error

#### Multiple Linear Regression

It(as the name suggests) is characterized by `multiple independent variables`

(more than `1`

). While you discover the simplest `fit line`

, you’ll be able to adjust a `polynomial or regression`

toward the `mean`

. And these are called `polynomial or regression`

toward the `mean`

.

### Assessing the performance of the model

#### How do we determine the best fit line?

The line for which the the `error`

between the `predicted values`

and the `observed values`

is minimum is called the `best fit line`

or `the regression line`

. These errors are also called as `residuals`

. The residuals can be visualized by the vertical lines from the observed data value to the `regression line`

.

### Bias-Variance tradeoff

`Bias`

are the simplifying `assumptions`

made by a model to make the target function easier to learn. Variance is the amount that the estimate of the target function will change if different training data was used. The goal of any supervised `machine learning`

algorithm is to achieve `low bias and low variance.`

In turn the algorithm should achieve `good`

prediction performance.

### How to determine error

### Gradient descent algorithm

`Gradient descent`

is the `backbone`

of an `machine learning`

algorithm.To estimate the predicted value for,`Y`

we will start with `random value`

for, *θ*

then derive cost using the above equation which stands for `Mean Squared Error(MSE)`

.Remember we will try to get the `minimum value`

of `cost function`

that we will get by `derivation`

of `cost function`

.

#### Gradient Descent Algorithm to reduce the cost function

### You might not end up in global minimum

### Implimentation with sklearn

### scikit-learn

- Machine Learning in Python
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable – BSD license

Learn more here: https://scikit-learn.org/stable/

Image Source: https://cdn-images-1.medium.com/max/2400/1*2NR51X0FDjLB13u4WdYc4g.png

Let’s discuss something about training a `ML model`

, this model generally will try to `predict`

one variable based on all the others. To verify how well this `model`

works, we need a second data set, the `test set`

. We use the model we learned from the `training data`

and see how well it predicts the variable in question for the `training set`

. When given a `data set`

for which you want to use `Machine Learning`

, typically you would divide it randomly into 2 sets. One will be used for `training`

, the other for `testing`

.

### Training and testing splitting

### Lets get started

import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt

from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, mean_squared_error

boston = load_boston() type(boston) sklearn.utils.Bunch boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

print(boston.DESCR)

.. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems.

boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

boston.target[: 5]

array([24. , 21.6, 34.7, 33.4, 36.2])

data = boston.data type(data) numpy.ndarray data.shape

(506, 13)

### DataFrame()

A `Data frame`

is a `two-dimensional`

data structure, i.e., data is aligned in a `tabular fashion`

in rows and columns.

data = pd.DataFrame(data = data, columns= boston.feature_names) data.head()

CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |

1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |

2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |

3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |

4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |

data['Price'] = boston.target data.head()

CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |

1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |

2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |

3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |

4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |

# Understand your data

data.describe()

CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |

mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |

std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |

min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |

25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |

50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |

75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |

max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |

### data.info()

Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we use the dataframe.info() function.

data.info()

<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null float64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null float64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 B 506 non-null float64 12 LSTAT 506 non-null float64 13 Price 506 non-null float64 dtypes: float64(14) memory usage: 55.5 KB

data.isnull().sum()

CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 Price 0 dtype: int64

# Data Visualization

We will start by creating a `scatterplot matrix`

that will allow us to visualize the `pair-wise relationships`

and `correlations`

between the different features.

It is also quite useful to have a quick overview of how the data is distributed and wheter it cointains or not outliers.

sns.pairplot(data) plt.show()

rows = 2 cols = 7 fig, ax = plt.subplots(nrows= rows, ncols= cols, figsize = (16,4)) col = data.columns index = 0 for i in range(rows): for j in range(cols): sns.distplot(data[col[index]], ax = ax[i][j]) index = index + 1 plt.tight_layout() plt.show()

We are going to create now a `correlation matrix`

to quantify and summarize the relationships between the variables.

This `correlation matrix`

is closely related witn `covariance matrix`

, in fact it is a rescaled version of the `covariance matrix`

, computed from standardize features.

It is a square matrix (with the same number of columns and rows) that contains the Person’s r `correlation coefficient`

.

corrmat = data.corr() corrmat

CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

CRIM | 1.000000 | -0.200469 | 0.406583 | -0.055892 | 0.420972 | -0.219247 | 0.352734 | -0.379670 | 0.625505 | 0.582764 | 0.289946 | -0.385064 | 0.455621 | -0.388305 |

ZN | -0.200469 | 1.000000 | -0.533828 | -0.042697 | -0.516604 | 0.311991 | -0.569537 | 0.664408 | -0.311948 | -0.314563 | -0.391679 | 0.175520 | -0.412995 | 0.360445 |

INDUS | 0.406583 | -0.533828 | 1.000000 | 0.062938 | 0.763651 | -0.391676 | 0.644779 | -0.708027 | 0.595129 | 0.720760 | 0.383248 | -0.356977 | 0.603800 | -0.483725 |

CHAS | -0.055892 | -0.042697 | 0.062938 | 1.000000 | 0.091203 | 0.091251 | 0.086518 | -0.099176 | -0.007368 | -0.035587 | -0.121515 | 0.048788 | -0.053929 | 0.175260 |

NOX | 0.420972 | -0.516604 | 0.763651 | 0.091203 | 1.000000 | -0.302188 | 0.731470 | -0.769230 | 0.611441 | 0.668023 | 0.188933 | -0.380051 | 0.590879 | -0.427321 |

RM | -0.219247 | 0.311991 | -0.391676 | 0.091251 | -0.302188 | 1.000000 | -0.240265 | 0.205246 | -0.209847 | -0.292048 | -0.355501 | 0.128069 | -0.613808 | 0.695360 |

AGE | 0.352734 | -0.569537 | 0.644779 | 0.086518 | 0.731470 | -0.240265 | 1.000000 | -0.747881 | 0.456022 | 0.506456 | 0.261515 | -0.273534 | 0.602339 | -0.376955 |

DIS | -0.379670 | 0.664408 | -0.708027 | -0.099176 | -0.769230 | 0.205246 | -0.747881 | 1.000000 | -0.494588 | -0.534432 | -0.232471 | 0.291512 | -0.496996 | 0.249929 |

RAD | 0.625505 | -0.311948 | 0.595129 | -0.007368 | 0.611441 | -0.209847 | 0.456022 | -0.494588 | 1.000000 | 0.910228 | 0.464741 | -0.444413 | 0.488676 | -0.381626 |

TAX | 0.582764 | -0.314563 | 0.720760 | -0.035587 | 0.668023 | -0.292048 | 0.506456 | -0.534432 | 0.910228 | 1.000000 | 0.460853 | -0.441808 | 0.543993 | -0.468536 |

PTRATIO | 0.289946 | -0.391679 | 0.383248 | -0.121515 | 0.188933 | -0.355501 | 0.261515 | -0.232471 | 0.464741 | 0.460853 | 1.000000 | -0.177383 | 0.374044 | -0.507787 |

B | -0.385064 | 0.175520 | -0.356977 | 0.048788 | -0.380051 | 0.128069 | -0.273534 | 0.291512 | -0.444413 | -0.441808 | -0.177383 | 1.000000 | -0.366087 | 0.333461 |

LSTAT | 0.455621 | -0.412995 | 0.603800 | -0.053929 | 0.590879 | -0.613808 | 0.602339 | -0.496996 | 0.488676 | 0.543993 | 0.374044 | -0.366087 | 1.000000 | -0.737663 |

Price | -0.388305 | 0.360445 | -0.483725 | 0.175260 | -0.427321 | 0.695360 | -0.376955 | 0.249929 | -0.381626 | -0.468536 | -0.507787 | 0.333461 | -0.737663 | 1.000000 |

### Heatmap ( )

A `heatmap`

is a `two-dimensional`

graphical representation of data where the individual values that are contained in a `matrix`

are represented as `colors`

. The `seaborn`

python package allows the creation of `annotated heatmaps`

which can be tweaked using `Matplotlib`

tools as per the creator’s requirement.

Now try look into the following script:

fig, ax = plt.subplots(figsize = (18, 10)) sns.heatmap(corrmat, annot = True, annot_kws={'size': 12}) plt.show()

corrmat.index.values

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'Price'], dtype=object)

def getCorrelatedFeature(corrdata, threshold): feature = [] value = [] for i, index in enumerate(corrdata.index): if abs(corrdata[index])> threshold: feature.append(index) value.append(corrdata[index]) df = pd.DataFrame(data = value, index = feature, columns=['Corr Value']) return df

threshold = 0.50 corr_value = getCorrelatedFeature(corrmat['Price'], threshold) corr_value

Corr Value | |
---|---|

RM | 0.695360 |

PTRATIO | -0.507787 |

LSTAT | -0.737663 |

Price | 1.000000 |

corr_value.index.values

array(['RM', 'PTRATIO', 'LSTAT', 'Price'], dtype=object)

correlated_data = data[corr_value.index] correlated_data.head()

RM | PTRATIO | LSTAT | Price | |
---|---|---|---|---|

0 | 6.575 | 15.3 | 4.98 | 24.0 |

1 | 6.421 | 17.8 | 9.14 | 21.6 |

2 | 7.185 | 17.8 | 4.03 | 34.7 |

3 | 6.998 | 18.7 | 2.94 | 33.4 |

4 | 7.147 | 18.7 | 5.33 | 36.2 |

# Pairplot and Corrmat of correlated data

A pairplot plot a `pairwise`

relationships in a dataset. Let’s look at the pair plot of correlated data.

sns.pairplot(correlated_data) plt.tight_layout()

sns.heatmap(correlated_data.corr(), annot=True, annot_kws={'size': 12},linewidth =0) plt.show()

# Shuffle and Split Data

we will take the `Boston housing dataset`

and split the data into training and testing subsets. Typically, the data is also `shuffled`

into a `random order`

when creating the `training`

and `testing`

subsets to remove any bias in the ordering of the dataset.

Let’s try to observe the following script:

X = correlated_data.drop(labels=['Price'], axis = 1) y = correlated_data['Price'] X.head()

RM | PTRATIO | LSTAT | |
---|---|---|---|

0 | 6.575 | 15.3 | 4.98 |

1 | 6.421 | 17.8 | 9.14 |

2 | 7.185 | 17.8 | 4.03 |

3 | 6.998 | 18.7 | 2.94 |

4 | 7.147 | 18.7 | 5.33 |

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) X_train.shape, X_test.shape

((404, 3), (102, 3))

## Lets train the mode

model = LinearRegression() model.fit(X_train, y_train) LinearRegression() y_predict = model.predict(X_test) df = pd.DataFrame(data = [y_predict, y_test]) df.T

0 | 1 | |
---|---|---|

0 | 27.609031 | 22.6 |

1 | 22.099034 | 50.0 |

2 | 26.529255 | 23.0 |

3 | 12.507986 | 8.3 |

4 | 22.254879 | 21.2 |

… | … | … |

97 | 28.271228 | 24.7 |

98 | 18.467419 | 14.1 |

99 | 18.558070 | 18.7 |

100 | 24.681964 | 28.1 |

101 | 20.826879 | 19.8 |

102 rows × 2 columns

## Defining performance metrics

It is difficult to measure the `quality`

of a given model without `quantifying`

its performance over `training`

and `testing`

. This is typically done using some type of performance metric, whether it is through calculating some type of `error`

, the goodness of fit, or some other useful measurement. For this project, you will be calculating the `coefficient of determination`

, `R2`

, to quantify your model’s performance. `The coefficient of determination`

for a model is a useful statistic in `regression analysis`

, as it often describes how “good” that model is at making predictions.

The values for `R2`

range from `0`

to `1`

, which captures the percentage of `squared correlation`

between the predicted and actual values of the target variable. A model with an `R2`

of `0`

always `fails`

to predict the target variable, whereas a model with an `R2`

of `1`

`perfectly`

predicts the target variable. Any value between `0`

and `1`

indicates what `percentage`

of the target variable, using this model, can be explained by the features. A model can be given a negative `R2`

as well, which indicates that the model is no better than one that naively predicts the `m`

ean of the target variable.

For the `performance_metric function`

in the code cell below, you will need to implement the following:

Use r2_score from `sklearn.metrics`

to perform a performance calculation between `y_true`

and `y_predict`

. Assign the performance score to the score variable.

Now we will find $R^2$ which is defined as follows :

$$SS_{t} = {\frac 1n\sum_{i=1}^n(y_i-\hat{y})^2}$$

$$SS_{r} = {\frac 1n\sum_{i=1}^n(y_i-\hat{y}^2}$$

$$R^{2} = 1-\frac{SS}{SS}$$ *SSt* = total sum of squares*SSr* = total sum of squares of residuals*R*2 = range from 0 to 1 and also negative if model is completely wrong.

## Regression Evaluation Metrics

Here are three common evaluation metrics for regression problems:

**Mean Absolute Error** (MAE) is the mean of the `absolute value of the errors`

: $$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

**Mean Squared Error** (MSE) is the mean of the `squared errors`

: $${\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

**Root Mean Squared Error** (RMSE) is the square root of the `mean of the squared errors`

: $$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

**MAE**is the easiest to`understand`

, because it’s the`average error`

.**MSE**is more popular than MAE, because MSE`"punishes"`

larger errors, which tends to be useful in the real world.**RMSE**is even more popular than MSE, because RMSE is`interpretable in the "y" units`

.

All of these are **loss functions**, because we want to minimize them.

from sklearn.metrics import r2_score

correlated_data.columns

Index(['RM', 'PTRATIO', 'LSTAT', 'Price'], dtype='object')

score = r2_score(y_test, y_predict) mae = mean_absolute_error(y_test, y_predict) mse = mean_squared_error(y_test, y_predict) print('r2_score: ', score) print('mae: ', mae) print('mse: ', mse)

r2_score: 0.48816420156925056 mae: 4.404434993909258 mse: 41.67799012221684

### Store feature performance

total_features = [] total_features_name = [] selected_correlation_value = [] r2_scores = [] mae_value = [] mse_value = []

def performance_metrics(features, th, y_true, y_pred): score = r2_score(y_true, y_pred) mae = mean_absolute_error(y_true, y_pred) mse = mean_squared_error(y_true, y_pred) total_features.append(len(features)-1) total_features_name.append(str(features)) selected_correlation_value.append(th) r2_scores.append(score) mae_value.append(mae) mse_value.append(mse) metrics_dataframe = pd.DataFrame(data= [total_features_name, total_features, selected_correlation_value, r2_scores, mae_value, mse_value], index = ['features name', '#feature', 'corr_value', 'r2_score', 'MAE', 'MSE']) return metrics_dataframe.T

performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|

0 | [‘RM’ ‘PTRATIO’ ‘LSTAT’ ‘Price’] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |

# Regression plot of the features correlated with the House Price

Let’s try to plot the features in correlation the house price:

rows = 2 cols = 2 fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize = (16, 4)) ax[0, 0].set_title("House price with respect to RM") ax[0, 1].set_title("House price with respect to PTRATIO") ax[1, 0].set_title("House price with respect to LSTAT") ax[1, 1].set_title("House price with respect to PRICE") col = correlated_data.columns index = 0 for i in range(rows): for j in range(cols): sns.regplot(x = correlated_data[col[index]], y = correlated_data['Price'], ax = ax[i][j]) index = index + 1 fig.tight_layout()

# Let’s find out other combination of columns to get better accuracy with >60%

corrmat['Price']

CRIM -0.388305 ZN 0.360445 INDUS -0.483725 CHAS 0.175260 NOX -0.427321 RM 0.695360 AGE -0.376955 DIS 0.249929 RAD -0.381626 TAX -0.468536 PTRATIO -0.507787 B 0.333461 LSTAT -0.737663 Price 1.000000 Name: Price, dtype: float64

threshold = 0.60 corr_value = getCorrelatedFeature(corrmat['Price'], threshold) corr_value

Corr Value | |
---|---|

RM | 0.695360 |

LSTAT | -0.737663 |

Price | 1.000000 |

correlated_data = data[corr_value.index] correlated_data.head()

RM | LSTAT | Price | |
---|---|---|---|

0 | 6.575 | 4.98 | 24.0 |

1 | 6.421 | 9.14 | 21.6 |

2 | 7.185 | 4.03 | 34.7 |

3 | 6.998 | 2.94 | 33.4 |

4 | 7.147 | 5.33 | 36.2 |

Prediction of `y`

from the `corr_data`

. This function return a `predicted`

value for `y`

.

def get_y_predict(corr_data): X = corr_data.drop(labels = ['Price'], axis = 1) y = corr_data['Price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) model = LinearRegression() model.fit(X_train, y_train) y_predict = model.predict(X_test) return y_predict

y_predict = get_y_predict(correlated_data) performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|

0 | [‘RM’ ‘PTRATIO’ ‘LSTAT’ ‘Price’] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |

1 | [‘RM’ ‘LSTAT’ ‘Price’] | 2 | 0.6 | 0.540908 | 4.14244 | 37.3831 |

# Let’s find out other combination of columns to get better accuracy > 70%

corrmat['Price']

CRIM -0.388305 ZN 0.360445 INDUS -0.483725 CHAS 0.175260 NOX -0.427321 RM 0.695360 AGE -0.376955 DIS 0.249929 RAD -0.381626 TAX -0.468536 PTRATIO -0.507787 B 0.333461 LSTAT -0.737663 Price 1.000000 Name: Price, dtype: float64

threshold = 0.70 corr_value = getCorrelatedFeature(corrmat['Price'], threshold) corr_value

Corr Value | |
---|---|

LSTAT | -0.737663 |

Price | 1.000000 |

correlated_data = data[corr_value.index] correlated_data.head()

LSTAT | Price | |
---|---|---|

0 | 4.98 | 24.0 |

1 | 9.14 | 21.6 |

2 | 4.03 | 34.7 |

3 | 2.94 | 33.4 |

4 | 5.33 | 36.2 |

y_predict = get_y_predict(correlated_data) performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|

0 | [‘RM’ ‘PTRATIO’ ‘LSTAT’ ‘Price’] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |

1 | [‘RM’ ‘LSTAT’ ‘Price’] | 2 | 0.6 | 0.540908 | 4.14244 | 37.3831 |

2 | [‘LSTAT’ ‘Price’] | 1 | 0.7 | 0.430957 | 4.86401 | 46.3363 |

# Let’s go ahead and select only RM feature

correlated_data = data[['RM', 'Price']] correlated_data.head()

RM | Price | |
---|---|---|

0 | 6.575 | 24.0 |

1 | 6.421 | 21.6 |

2 | 7.185 | 34.7 |

3 | 6.998 | 33.4 |

4 | 7.147 | 36.2 |

y_predict = get_y_predict(correlated_data) performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|

0 | [‘RM’ ‘PTRATIO’ ‘LSTAT’ ‘Price’] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |

1 | [‘RM’ ‘LSTAT’ ‘Price’] | 2 | 0.6 | 0.540908 | 4.14244 | 37.3831 |

2 | [‘LSTAT’ ‘Price’] | 1 | 0.7 | 0.430957 | 4.86401 | 46.3363 |

3 | [‘RM’ ‘Price’] | 1 | 0.7 | 0.423944 | 4.32474 | 46.9074 |

# Let’s find out other combination of columns to get better accuracy > 40%

threshold = 0.40 corr_value = getCorrelatedFeature(corrmat['Price'], threshold) corr_value

Corr Value | |
---|---|

INDUS | -0.483725 |

NOX | -0.427321 |

RM | 0.695360 |

TAX | -0.468536 |

PTRATIO | -0.507787 |

LSTAT | -0.737663 |

Price | 1.000000 |

correlated_data = data[corr_value.index] correlated_data.head()

INDUS | NOX | RM | TAX | PTRATIO | LSTAT | Price | |
---|---|---|---|---|---|---|---|

0 | 2.31 | 0.538 | 6.575 | 296.0 | 15.3 | 4.98 | 24.0 |

1 | 7.07 | 0.469 | 6.421 | 242.0 | 17.8 | 9.14 | 21.6 |

2 | 7.07 | 0.469 | 7.185 | 242.0 | 17.8 | 4.03 | 34.7 |

3 | 2.18 | 0.458 | 6.998 | 222.0 | 18.7 | 2.94 | 33.4 |

4 | 2.18 | 0.458 | 7.147 | 222.0 | 18.7 | 5.33 | 36.2 |

features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|

0 | [‘RM’ ‘PTRATIO’ ‘LSTAT’ ‘Price’] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |

1 | [‘RM’ ‘LSTAT’ ‘Price’] | 2 | 0.6 | 0.540908 | 4.14244 | 37.3831 |

2 | [‘LSTAT’ ‘Price’] | 1 | 0.7 | 0.430957 | 4.86401 | 46.3363 |

3 | [‘RM’ ‘Price’] | 1 | 0.7 | 0.423944 | 4.32474 | 46.9074 |

4 | [‘INDUS’ ‘NOX’ ‘RM’ ‘TAX’ ‘PTRATIO’ ‘LSTAT’ ‘P… | 6 | 0.4 | 0.476203 | 4.3945 | 42.6519 |

### Now lets go ahead and understand what is Normalization and Standardization

#### Standardization

`Standardization`

of data sets is a common requirement for many machine learning `estimators`

implemented in `scikit-learn`

; they might behave `badly`

if the individual features do not more or less look like `standard normally distributed data`

: `Gaussian`

with `zero mean`

and `unit variance`

.

#### Normalization

`Normalization`

is the process of scaling individual samples to have `unit norm`

. This process can be useful if you plan to use a quadratic form such as the dot-product or any other `kernel`

to quantify the similarity of any pair of samples.

This assumption is the base of the `Vector Space Model`

often used in `text classification`

and `clustering contexts`

.

Name | Sklearn_class |
---|---|

Standard scaler | Standard scaler |

MinMaxScaler | MinMax Scaler |

MaxAbs Scaler | MaxAbs Scaler |

Robust scaler | Robust scaler |

Quantile Transformer_Normal | Quantile Transformer(output_distribution =’normal’) |

Quantile Transformer_Uniform | Quantile Transformer(output_distribution = ‘uniform’) |

PowerTransformer-Yeo-Johnson | PowerTransformer(method = ‘yeo-johnson’) |

Normalizer | Normalizer |

model = LinearRegression(normalize=True) model.fit(X_train, y_train) LinearRegression(normalize=True) y_predict = model.predict(X_test) r2_score(y_test, y_predict)

0.48816420156925067

## Defining performance metrics

### Plotting Learning Curves

Now we will try to plot the Learning curves:

from sklearn.model_selection import learning_curve, ShuffleSplit

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 10)): plt.figure() plt.title(title) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") return plt X = correlated_data.drop(labels = ['Price'], axis = 1) y = correlated_data['Price'] title = "Learning Curves (Linear Regression) " + str(X.columns.values) cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0) estimator = LinearRegression() plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1) plt.show()

## 2 Comments