Linear Regression with Python | Machine learlearning | KGP Talkie

Published by KGP Talkie on 7 August 20207 August 2020

What is Linear Regression?

You are a real estate agent and you want to predict the house price. It would be great if you can make some kind of automated system which predict price of a house based on various input which is known as feature.

Supervised Machine learning algorithms needs some data to train its model before making a prediction. For that we have a Boston Dataset.

Where can Linear Regression be used?

It is a very powerful technique and can be used to understand the factors that influence profitability. It can be used to forecast sales in the coming months by analyzing the sales data for previous months. It can also be used to gain various insights about customer behaviour.

What is Regression?

Let's first understand what exactly Regression means it is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables known as independent variables.
Linear Regression is a statistical technique where based on a set of independent variable(s) a dependent variable is predicted.

Regression Examples

Stock prediction

We can predict the price of stock depends on depedent variable,x. let's say recent history of stock price,news events.

Tweet popularity

We can also estimate number of people will retweet for your tweet in tewitter based number of followers,popularity of hashtag.

In real estate

As we discussed earlier,We can also predict the house prices and land prices in real estate.

Regression Types

It is of two types: simple linear regression and multiple linear regression.
Simple linear regression: It is characterized by an variable quantity.

Simple Linear Regression

yi=β0+β1Xi+εi

y = dependent variable
β0 = population of intercept
βi = population of co-efficient
x = independent variable
εi = Random error

Multiple Linear Regression

It(as the name suggests) is characterized by multiple independent variables (more than 1). While you discover the simplest fit line, you'll be able to adjust a polynomial or regression toward the mean. And these are called polynomial or regression toward the mean.

Assessing the performance of the model

How do we determine the best fit line?

The line for which the the error between the predicted values and the observed values is minimum is called the best fit line or the regression line. These errors are also called as residuals. The residuals can be visualized by the vertical lines from the observed data value to the regression line.

Bias-Variance tradeoff

Bias are the simplifying assumptions made by a model to make the target function easier to learn. Variance is the amount that the estimate of the target function will change if different training data was used. The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance.

How to determine error

Gradient descent algorithm

Gradient descent is the backbone of an machine learning algorithm.To estimate the predicted value for,Y we will start with random value for, θ

then derive cost using the above equation which stands for Mean Squared Error(MSE).Remember we will try to get the minimum value of cost function that we will get by derivation of cost function .

Gradient Descent Algorithm to reduce the cost function

You might not end up in global minimum

Implimentation with sklearn

scikit-learn

Machine Learning in Python
Simple and efficient tools for data mining and data analysis
Accessible to everybody, and reusable in various contexts
Built on NumPy, SciPy, and matplotlib
Open source, commercially usable - BSD license

Learn more here: https://scikit-learn.org/stable/

Image Source: https://cdn-images-1.medium.com/max/2400/1*2NR51X0FDjLB13u4WdYc4g.png

Let's discuss something about training a ML model, this model generally will try to predict one variable based on all the others. To verify how well this model works, we need a second data set, the test set. We use the model we learned from the training data and see how well it predicts the variable in question for the training set. When given a data set for which you want to use Machine Learning, typically you would divide it randomly into 2 sets. One will be used for training, the other for testing.

Training and testing splitting

Lets get started

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

boston = load_boston()
type(boston)
sklearn.utils.Bunch
boston.keys()

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

boston.target[: 5]

array([24. , 21.6, 34.7, 33.4, 36.2])

data = boston.data
type(data)
numpy.ndarray
data.shape

(506, 13)

DataFrame()

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

data = pd.DataFrame(data = data, columns= boston.feature_names)
data.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33

data['Price'] = boston.target
data.head()

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	Price
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

Understand your data

data.describe()

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	Price
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063	22.532806
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062	9.197104
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000	5.000000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000	17.025000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000	21.200000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000	25.000000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000	50.000000

data.info()

Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we use the dataframe.info() function.

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     506 non-null    float64
 1   ZN       506 non-null    float64
 2   INDUS    506 non-null    float64
 3   CHAS     506 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      506 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    float64
 9   TAX      506 non-null    float64
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    506 non-null    float64
 13  Price    506 non-null    float64
dtypes: float64(14)
memory usage: 55.5 KB

data.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
Price      0
dtype: int64

Data Visualization

We will start by creating a scatterplot matrix that will allow us to visualize the pair-wise relationships and correlations between the different features.

It is also quite useful to have a quick overview of how the data is distributed and wheter it cointains or not outliers.

sns.pairplot(data)
plt.show()

rows = 2
cols = 7
fig, ax = plt.subplots(nrows= rows, ncols= cols, figsize = (16,4))
col = data.columns
index = 0
for i in range(rows):
    for j in range(cols):
        sns.distplot(data[col[index]], ax = ax[i][j])
        index = index + 1
plt.tight_layout()
plt.show()

We are going to create now a correlation matrix to quantify and summarize the relationships between the variables.

This correlation matrix is closely related witn covariance matrix, in fact it is a rescaled version of the covariance matrix, computed from standardize features.

It is a square matrix (with the same number of columns and rows) that contains the Person’s r correlation coefficient.

corrmat = data.corr()
corrmat

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	Price
CRIM	1.000000	-0.200469	0.406583	-0.055892	0.420972	-0.219247	0.352734	-0.379670	0.625505	0.582764	0.289946	-0.385064	0.455621	-0.388305
ZN	-0.200469	1.000000	-0.533828	-0.042697	-0.516604	0.311991	-0.569537	0.664408	-0.311948	-0.314563	-0.391679	0.175520	-0.412995	0.360445
INDUS	0.406583	-0.533828	1.000000	0.062938	0.763651	-0.391676	0.644779	-0.708027	0.595129	0.720760	0.383248	-0.356977	0.603800	-0.483725
CHAS	-0.055892	-0.042697	0.062938	1.000000	0.091203	0.091251	0.086518	-0.099176	-0.007368	-0.035587	-0.121515	0.048788	-0.053929	0.175260
NOX	0.420972	-0.516604	0.763651	0.091203	1.000000	-0.302188	0.731470	-0.769230	0.611441	0.668023	0.188933	-0.380051	0.590879	-0.427321
RM	-0.219247	0.311991	-0.391676	0.091251	-0.302188	1.000000	-0.240265	0.205246	-0.209847	-0.292048	-0.355501	0.128069	-0.613808	0.695360
AGE	0.352734	-0.569537	0.644779	0.086518	0.731470	-0.240265	1.000000	-0.747881	0.456022	0.506456	0.261515	-0.273534	0.602339	-0.376955
DIS	-0.379670	0.664408	-0.708027	-0.099176	-0.769230	0.205246	-0.747881	1.000000	-0.494588	-0.534432	-0.232471	0.291512	-0.496996	0.249929
RAD	0.625505	-0.311948	0.595129	-0.007368	0.611441	-0.209847	0.456022	-0.494588	1.000000	0.910228	0.464741	-0.444413	0.488676	-0.381626
TAX	0.582764	-0.314563	0.720760	-0.035587	0.668023	-0.292048	0.506456	-0.534432	0.910228	1.000000	0.460853	-0.441808	0.543993	-0.468536
PTRATIO	0.289946	-0.391679	0.383248	-0.121515	0.188933	-0.355501	0.261515	-0.232471	0.464741	0.460853	1.000000	-0.177383	0.374044	-0.507787
B	-0.385064	0.175520	-0.356977	0.048788	-0.380051	0.128069	-0.273534	0.291512	-0.444413	-0.441808	-0.177383	1.000000	-0.366087	0.333461
LSTAT	0.455621	-0.412995	0.603800	-0.053929	0.590879	-0.613808	0.602339	-0.496996	0.488676	0.543993	0.374044	-0.366087	1.000000	-0.737663
Price	-0.388305	0.360445	-0.483725	0.175260	-0.427321	0.695360	-0.376955	0.249929	-0.381626	-0.468536	-0.507787	0.333461	-0.737663	1.000000

Heatmap ( )

A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors. The seaborn python package allows the creation of annotated heatmaps which can be tweaked using Matplotlib tools as per the creator's requirement.
Now try look into the following script:

fig, ax = plt.subplots(figsize = (18, 10))
sns.heatmap(corrmat, annot = True, annot_kws={'size': 12})
plt.show()

corrmat.index.values

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
       'TAX', 'PTRATIO', 'B', 'LSTAT', 'Price'], dtype=object)

def getCorrelatedFeature(corrdata, threshold):
    feature = []
    value = []
    
    for i, index in enumerate(corrdata.index):
        if abs(corrdata[index])> threshold:
            feature.append(index)
            value.append(corrdata[index])
            
    df = pd.DataFrame(data = value, index = feature, columns=['Corr Value'])
    return df

threshold = 0.50
corr_value = getCorrelatedFeature(corrmat['Price'], threshold)
corr_value

	Corr Value
RM	0.695360
PTRATIO	-0.507787
LSTAT	-0.737663
Price	1.000000

corr_value.index.values

array(['RM', 'PTRATIO', 'LSTAT', 'Price'], dtype=object)

correlated_data = data[corr_value.index]
correlated_data.head()

	RM	PTRATIO	LSTAT	Price
0	6.575	15.3	4.98	24.0
1	6.421	17.8	9.14	21.6
2	7.185	17.8	4.03	34.7
3	6.998	18.7	2.94	33.4
4	7.147	18.7	5.33	36.2

Pairplot and Corrmat of correlated data

A pairplot plot a pairwise relationships in a dataset. Let's look at the pair plot of correlated data.

sns.pairplot(correlated_data)
plt.tight_layout()

sns.heatmap(correlated_data.corr(), annot=True, annot_kws={'size': 12},linewidth =0)
plt.show()

Shuffle and Split Data

we will take the Boston housing dataset and split the data into training and testing subsets. Typically, the data is also shuffled into a random order when creating the training and testing subsets to remove any bias in the ordering of the dataset.
Let's try to observe the following script:

X = correlated_data.drop(labels=['Price'], axis = 1)
y = correlated_data['Price']
X.head()

	RM	PTRATIO	LSTAT
0	6.575	15.3	4.98
1	6.421	17.8	9.14
2	7.185	17.8	4.03
3	6.998	18.7	2.94
4	7.147	18.7	5.33

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
X_train.shape, X_test.shape

((404, 3), (102, 3))

Lets train the mode

model = LinearRegression()
model.fit(X_train, y_train)
LinearRegression()
y_predict = model.predict(X_test)
df = pd.DataFrame(data = [y_predict, y_test])
df.T

	0	1
0	27.609031	22.6
1	22.099034	50.0
2	26.529255	23.0
3	12.507986	8.3
4	22.254879	21.2
...	...	...
97	28.271228	24.7
98	18.467419	14.1
99	18.558070	18.7
100	24.681964	28.1
101	20.826879	19.8

102 rows × 2 columns

Defining performance metrics

It is difficult to measure the quality of a given model without quantifying its performance over training and testing. This is typically done using some type of performance metric, whether it is through calculating some type of error, the goodness of fit, or some other useful measurement. For this project, you will be calculating the coefficient of determination, R2, to quantify your model's performance. The coefficient of determination for a model is a useful statistic in regression analysis, as it often describes how "good" that model is at making predictions.

The values for R2 range from 0 to 1, which captures the percentage of squared correlation between the predicted and actual values of the target variable. A model with an R2 of 0 always fails to predict the target variable, whereas a model with an R2 of 1 perfectly predicts the target variable. Any value between 0 and 1 indicates what percentage of the target variable, using this model, can be explained by the features. A model can be given a negative R2 as well, which indicates that the model is no better than one that naively predicts the mean of the target variable.

For the performance_metric function in the code cell below, you will need to implement the following:

Use r2_score from sklearn.metrics to perform a performance calculation between y_true and y_predict. Assign the performance score to the score variable.

Now we will find $R^2$ which is defined as follows :
$$SS_{t} = {\frac 1n\sum_{i=1}^n(y_i-\hat{y})^2}$$
$$SS_{r} = {\frac 1n\sum_{i=1}^n(y_i-\hat{y}^2}$$
$$R^{2} = 1-\frac{SS}{SS}$$ SSt = total sum of squares
SSr = total sum of squares of residuals
R2 = range from 0 to 1 and also negative if model is completely wrong.

Regression Evaluation Metrics

Here are three common evaluation metrics for regression problems:

Mean Absolute Error (MAE) is the mean of the absolute value of the errors: $$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$

Mean Squared Error (MSE) is the mean of the squared errors: $${\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors: $$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$

Comparing these metrics:

MAE is the easiest to understand, because it's the average error.
MSE is more popular than MAE, because MSE "punishes" larger errors, which tends to be useful in the real world.
RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.

All of these are loss functions, because we want to minimize them.

from sklearn.metrics import r2_score

correlated_data.columns

Index(['RM', 'PTRATIO', 'LSTAT', 'Price'], dtype='object')

score = r2_score(y_test, y_predict)
mae = mean_absolute_error(y_test, y_predict)
mse = mean_squared_error(y_test, y_predict)
print('r2_score: ', score)
print('mae: ', mae)
print('mse: ', mse)

r2_score:  0.48816420156925056
mae:  4.404434993909258
mse:  41.67799012221684

Store feature performance

total_features = []
total_features_name = []
selected_correlation_value = []
r2_scores = []
mae_value = []
mse_value = []

def performance_metrics(features, th, y_true, y_pred):
    score = r2_score(y_true, y_pred)
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    total_features.append(len(features)-1)
    total_features_name.append(str(features))
    selected_correlation_value.append(th)
    r2_scores.append(score)
    mae_value.append(mae)
    mse_value.append(mse)
    
    metrics_dataframe = pd.DataFrame(data= [total_features_name, total_features, selected_correlation_value, r2_scores, mae_value, mse_value],
                                    index = ['features name', '#feature', 'corr_value', 'r2_score', 'MAE', 'MSE'])
    return metrics_dataframe.T

performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

	features name	#feature	corr_value	r2_score	MAE	MSE
0	['RM' 'PTRATIO' 'LSTAT' 'Price']	3	0.5	0.488164	4.40443	41.678

Regression plot of the features correlated with the House Price

Let's try to plot the features in correlation the house price:

rows = 2
cols = 2
fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize = (16, 4))
ax[0, 0].set_title("House price with respect  to RM")
ax[0, 1].set_title("House price with respect to PTRATIO")
ax[1, 0].set_title("House price with respect to LSTAT")
ax[1, 1].set_title("House price with respect to PRICE")
col = correlated_data.columns
index = 0

for i in range(rows):
    for j in range(cols):
        sns.regplot(x = correlated_data[col[index]], y = correlated_data['Price'], ax = ax[i][j])
        index = index + 1
fig.tight_layout()

Let's find out other combination of columns to get better accuracy with >60%

corrmat['Price']

CRIM      -0.388305
ZN         0.360445
INDUS     -0.483725
CHAS       0.175260
NOX       -0.427321
RM         0.695360
AGE       -0.376955
DIS        0.249929
RAD       -0.381626
TAX       -0.468536
PTRATIO   -0.507787
B          0.333461
LSTAT     -0.737663
Price      1.000000
Name: Price, dtype: float64

threshold = 0.60
corr_value = getCorrelatedFeature(corrmat['Price'], threshold)
corr_value

	Corr Value
RM	0.695360
LSTAT	-0.737663
Price	1.000000

correlated_data = data[corr_value.index]
correlated_data.head()

	RM	LSTAT	Price
0	6.575	4.98	24.0
1	6.421	9.14	21.6
2	7.185	4.03	34.7
3	6.998	2.94	33.4
4	7.147	5.33	36.2

Prediction of y from the corr_data. This function return a predicted value for y.

def get_y_predict(corr_data):
    X = corr_data.drop(labels = ['Price'], axis = 1)
    y = corr_data['Price']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_predict = model.predict(X_test)
    return y_predict

y_predict = get_y_predict(correlated_data)
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

	features name	#feature	corr_value	r2_score	MAE	MSE
0	['RM' 'PTRATIO' 'LSTAT' 'Price']	3	0.5	0.488164	4.40443	41.678
1	['RM' 'LSTAT' 'Price']	2	0.6	0.540908	4.14244	37.3831

Let's find out other combination of columns to get better accuracy > 70%

corrmat['Price']

CRIM      -0.388305
ZN         0.360445
INDUS     -0.483725
CHAS       0.175260
NOX       -0.427321
RM         0.695360
AGE       -0.376955
DIS        0.249929
RAD       -0.381626
TAX       -0.468536
PTRATIO   -0.507787
B          0.333461
LSTAT     -0.737663
Price      1.000000
Name: Price, dtype: float64

threshold = 0.70
corr_value = getCorrelatedFeature(corrmat['Price'], threshold)
corr_value

	Corr Value
LSTAT	-0.737663
Price	1.000000

correlated_data = data[corr_value.index]
correlated_data.head()

	LSTAT	Price
0	4.98	24.0
1	9.14	21.6
2	4.03	34.7
3	2.94	33.4
4	5.33	36.2

y_predict = get_y_predict(correlated_data)
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

	features name	#feature	corr_value	r2_score	MAE	MSE
0	['RM' 'PTRATIO' 'LSTAT' 'Price']	3	0.5	0.488164	4.40443	41.678
1	['RM' 'LSTAT' 'Price']	2	0.6	0.540908	4.14244	37.3831
2	['LSTAT' 'Price']	1	0.7	0.430957	4.86401	46.3363

Let's go ahead and select only RM feature

correlated_data = data[['RM', 'Price']]
correlated_data.head()

	RM	Price
0	6.575	24.0
1	6.421	21.6
2	7.185	34.7
3	6.998	33.4
4	7.147	36.2

y_predict = get_y_predict(correlated_data)
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

	features name	#feature	corr_value	r2_score	MAE	MSE
0	['RM' 'PTRATIO' 'LSTAT' 'Price']	3	0.5	0.488164	4.40443	41.678
1	['RM' 'LSTAT' 'Price']	2	0.6	0.540908	4.14244	37.3831
2	['LSTAT' 'Price']	1	0.7	0.430957	4.86401	46.3363
3	['RM' 'Price']	1	0.7	0.423944	4.32474	46.9074

Let's find out other combination of columns to get better accuracy > 40%

threshold = 0.40
corr_value = getCorrelatedFeature(corrmat['Price'], threshold)
corr_value

	Corr Value
INDUS	-0.483725
NOX	-0.427321
RM	0.695360
TAX	-0.468536
PTRATIO	-0.507787
LSTAT	-0.737663
Price	1.000000

correlated_data = data[corr_value.index]
correlated_data.head()

	INDUS	NOX	RM	TAX	PTRATIO	LSTAT	Price
0	2.31	0.538	6.575	296.0	15.3	4.98	24.0
1	7.07	0.469	6.421	242.0	17.8	9.14	21.6
2	7.07	0.469	7.185	242.0	17.8	4.03	34.7
3	2.18	0.458	6.998	222.0	18.7	2.94	33.4
4	2.18	0.458	7.147	222.0	18.7	5.33	36.2

y_predict = get_y_predict(correlated_data)
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)

	features name	#feature	corr_value	r2_score	MAE	MSE
0	['RM' 'PTRATIO' 'LSTAT' 'Price']	3	0.5	0.488164	4.40443	41.678
1	['RM' 'LSTAT' 'Price']	2	0.6	0.540908	4.14244	37.3831
2	['LSTAT' 'Price']	1	0.7	0.430957	4.86401	46.3363
3	['RM' 'Price']	1	0.7	0.423944	4.32474	46.9074
4	['INDUS' 'NOX' 'RM' 'TAX' 'PTRATIO' 'LSTAT' 'P...	6	0.4	0.476203	4.3945	42.6519

Now lets go ahead and understand what is Normalization and Standardization

Standardization

Standardization of data sets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

Normalization

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

Name	Sklearn_class
Standard scaler	Standard scaler
MinMaxScaler	MinMax Scaler
MaxAbs Scaler	MaxAbs Scaler
Robust scaler	Robust scaler
Quantile Transformer_Normal	Quantile Transformer(output_distribution ='normal')
Quantile Transformer_Uniform	Quantile Transformer(output_distribution = 'uniform')
PowerTransformer-Yeo-Johnson	PowerTransformer(method = 'yeo-johnson')
Normalizer	Normalizer

model = LinearRegression(normalize=True)
model.fit(X_train, y_train)
LinearRegression(normalize=True)
y_predict = model.predict(X_test)
r2_score(y_test, y_predict)

0.48816420156925067

Defining performance metrics

Plotting Learning Curves

Now we will try to plot the Learning curves:

from sklearn.model_selection import learning_curve, ShuffleSplit

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 10)):
    
    plt.figure()
    plt.title(title)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

X = correlated_data.drop(labels = ['Price'], axis = 1)
y = correlated_data['Price']

title = "Learning Curves (Linear Regression) " + str(X.columns.values)

cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0)

estimator = LinearRegression()
plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1)
plt.show()

Linear Regression with Python | Machine learlearning | KGP Talkie

What is Linear Regression?

Where can Linear Regression be used?

What is Regression?

Regression Examples

Stock prediction

Tweet popularity

In real estate

Regression Types

Simple Linear Regression

Multiple Linear Regression

Assessing the performance of the model

How do we determine the best fit line?

Bias-Variance tradeoff

How to determine error

Gradient descent algorithm

Gradient Descent Algorithm to reduce the cost function

You might not end up in global minimum

Implimentation with sklearn

scikit-learn

Training and testing splitting

Lets get started

DataFrame()

Understand your data

data.info()

Data Visualization

Heatmap ( )

Pairplot and Corrmat of correlated data

Shuffle and Split Data

Lets train the mode

Defining performance metrics

Regression Evaluation Metrics

Store feature performance

Regression plot of the features correlated with the House Price

Let's find out other combination of columns to get better accuracy with >60%

Let's find out other combination of columns to get better accuracy > 70%

Let's go ahead and select only RM feature

Let's find out other combination of columns to get better accuracy > 40%

Now lets go ahead and understand what is Normalization and Standardization

Standardization

Normalization

Defining performance metrics

Plotting Learning Curves

2 Comments

Leave a Reply Cancel reply

Related Posts

Machine Learning

How to Become a Successful Machine Learning Engineer

Machine Learning

Interview Questions and Answers on TF-IDF in NLP and Machine Learning

Machine Learning

Top 10 Interview Questions and Answers for MLOps Engineers