Linear Regression with Python | Machine learlearning | KGP Talkie
What is Linear Regression?
You are a real estate
agent and you want to predict
the house price. It would be great if you can make some kind of automated system
which predict price of a house based on various input which is known as feature.
Supervised Machine learning
algorithms needs some data to train its model
before making a prediction. For that we have a Boston Dataset.
Where can Linear Regression be used?
It is a very powerful technique
and can be used to understand the factors that influence profitability
. It can be used to forecast sales
in the coming months by analyzing the sales data for previous months. It can also be used to gain various insights about customer behaviour
.
What is Regression?
Let's first understand what exactly Regression
means it is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength
and character
of the relationship between one dependent variable
(usually denoted by Y
) and a series of other variables known as independent variables
.Linear Regression
is a statistical technique where based on a set of independent variable(s)
a dependent variable is predicted
.
Regression Examples
Stock prediction
We can predict
the price of stock
depends on depedent variable,x
. let's say recent history of stock price,news events.
Tweet popularity
We can also estimate
number of people will retweet
for your tweet in tewitter based number of followers,popularity of hashtag.
In real estate
As we discussed earlier,We can also predict
the house prices and land prices in real estate.
Regression Types
It is of two types: simple linear regression
and multiple linear regression
.
Simple linear regression: It is characterized by an variable quantity
.
Simple Linear Regression
yi=β0+β1Xi+εi
y = dependent variable
β0 = population of intercept
βi = population of co-efficient
x = independent variable
εi = Random error
Multiple Linear Regression
It(as the name suggests) is characterized by multiple independent variables
(more than 1
). While you discover the simplest fit line
, you'll be able to adjust a polynomial or regression
toward the mean
. And these are called polynomial or regression
toward the mean
.
Assessing the performance of the model
How do we determine the best fit line?
The line for which the the error
between the predicted values
and the observed values
is minimum is called the best fit line
or the regression line
. These errors are also called as residuals
. The residuals can be visualized by the vertical lines from the observed data value to the regression line
.
Bias-Variance tradeoff
Bias
are the simplifying assumptions
made by a model to make the target function easier to learn. Variance is the amount that the estimate of the target function will change if different training data was used. The goal of any supervised machine learning
algorithm is to achieve low bias and low variance.
In turn the algorithm should achieve good
prediction performance.
How to determine error
Gradient descent algorithm
Gradient descent
is the backbone
of an machine learning
algorithm.To estimate the predicted value for,Y
we will start with random value
for, θ
then derive cost using the above equation which stands for Mean Squared Error(MSE)
.Remember we will try to get the minimum value
of cost function
that we will get by derivation
of cost function
.
Gradient Descent Algorithm to reduce the cost function
You might not end up in global minimum
Implimentation with sklearn
scikit-learn
- Machine Learning in Python
- Simple and efficient tools for data mining and data analysis
- Accessible to everybody, and reusable in various contexts
- Built on NumPy, SciPy, and matplotlib
- Open source, commercially usable - BSD license
Learn more here: https://scikit-learn.org/stable/
Image Source: https://cdn-images-1.medium.com/max/2400/1*2NR51X0FDjLB13u4WdYc4g.png
Let's discuss something about training a ML model
, this model generally will try to predict
one variable based on all the others. To verify how well this model
works, we need a second data set, the test set
. We use the model we learned from the training data
and see how well it predicts the variable in question for the training set
. When given a data set
for which you want to use Machine Learning
, typically you would divide it randomly into 2 sets. One will be used for training
, the other for testing
.
Training and testing splitting
Lets get started
import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt
from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_absolute_error, mean_squared_error
boston = load_boston() type(boston) sklearn.utils.Bunch boston.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
print(boston.DESCR)
.. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's :Missing Attribute Values: None :Creator: Harrison, D. and Rubinfeld, D.L. This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter. The Boston house-price data has been used in many machine learning papers that address regression problems.
boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
boston.target[: 5]
array([24. , 21.6, 34.7, 33.4, 36.2])
data = boston.data type(data) numpy.ndarray data.shape
(506, 13)
DataFrame()
A Data frame
is a two-dimensional
data structure, i.e., data is aligned in a tabular fashion
in rows and columns.
data = pd.DataFrame(data = data, columns= boston.feature_names) data.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 |
data['Price'] = boston.target data.head()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 396.90 | 4.98 | 24.0 |
1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 396.90 | 9.14 | 21.6 |
2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 392.83 | 4.03 | 34.7 |
3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 394.63 | 2.94 | 33.4 |
4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 396.90 | 5.33 | 36.2 |
Understand your data
data.describe()
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 356.674032 | 12.653063 | 22.532806 |
std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 91.294864 | 7.141062 | 9.197104 |
min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 0.320000 | 1.730000 | 5.000000 |
25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 375.377500 | 6.950000 | 17.025000 |
50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 391.440000 | 11.360000 | 21.200000 |
75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 396.225000 | 16.955000 | 25.000000 |
max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 396.900000 | 37.970000 | 50.000000 |
data.info()
Pandas dataframe.info() function is used to get a concise summary of the dataframe. It comes really handy when doing exploratory analysis of the data. To get a quick overview of the dataset we use the dataframe.info() function.
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null float64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null float64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 B 506 non-null float64 12 LSTAT 506 non-null float64 13 Price 506 non-null float64 dtypes: float64(14) memory usage: 55.5 KB
data.isnull().sum()
CRIM 0 ZN 0 INDUS 0 CHAS 0 NOX 0 RM 0 AGE 0 DIS 0 RAD 0 TAX 0 PTRATIO 0 B 0 LSTAT 0 Price 0 dtype: int64
Data Visualization
We will start by creating a scatterplot matrix
that will allow us to visualize the pair-wise relationships
and correlations
between the different features.
It is also quite useful to have a quick overview of how the data is distributed and wheter it cointains or not outliers.
sns.pairplot(data) plt.show()
rows = 2 cols = 7 fig, ax = plt.subplots(nrows= rows, ncols= cols, figsize = (16,4)) col = data.columns index = 0 for i in range(rows): for j in range(cols): sns.distplot(data[col[index]], ax = ax[i][j]) index = index + 1 plt.tight_layout() plt.show()
We are going to create now a correlation matrix
to quantify and summarize the relationships between the variables.
This correlation matrix
is closely related witn covariance matrix
, in fact it is a rescaled version of the covariance matrix
, computed from standardize features.
It is a square matrix (with the same number of columns and rows) that contains the Person’s r correlation coefficient
.
corrmat = data.corr() corrmat
CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CRIM | 1.000000 | -0.200469 | 0.406583 | -0.055892 | 0.420972 | -0.219247 | 0.352734 | -0.379670 | 0.625505 | 0.582764 | 0.289946 | -0.385064 | 0.455621 | -0.388305 |
ZN | -0.200469 | 1.000000 | -0.533828 | -0.042697 | -0.516604 | 0.311991 | -0.569537 | 0.664408 | -0.311948 | -0.314563 | -0.391679 | 0.175520 | -0.412995 | 0.360445 |
INDUS | 0.406583 | -0.533828 | 1.000000 | 0.062938 | 0.763651 | -0.391676 | 0.644779 | -0.708027 | 0.595129 | 0.720760 | 0.383248 | -0.356977 | 0.603800 | -0.483725 |
CHAS | -0.055892 | -0.042697 | 0.062938 | 1.000000 | 0.091203 | 0.091251 | 0.086518 | -0.099176 | -0.007368 | -0.035587 | -0.121515 | 0.048788 | -0.053929 | 0.175260 |
NOX | 0.420972 | -0.516604 | 0.763651 | 0.091203 | 1.000000 | -0.302188 | 0.731470 | -0.769230 | 0.611441 | 0.668023 | 0.188933 | -0.380051 | 0.590879 | -0.427321 |
RM | -0.219247 | 0.311991 | -0.391676 | 0.091251 | -0.302188 | 1.000000 | -0.240265 | 0.205246 | -0.209847 | -0.292048 | -0.355501 | 0.128069 | -0.613808 | 0.695360 |
AGE | 0.352734 | -0.569537 | 0.644779 | 0.086518 | 0.731470 | -0.240265 | 1.000000 | -0.747881 | 0.456022 | 0.506456 | 0.261515 | -0.273534 | 0.602339 | -0.376955 |
DIS | -0.379670 | 0.664408 | -0.708027 | -0.099176 | -0.769230 | 0.205246 | -0.747881 | 1.000000 | -0.494588 | -0.534432 | -0.232471 | 0.291512 | -0.496996 | 0.249929 |
RAD | 0.625505 | -0.311948 | 0.595129 | -0.007368 | 0.611441 | -0.209847 | 0.456022 | -0.494588 | 1.000000 | 0.910228 | 0.464741 | -0.444413 | 0.488676 | -0.381626 |
TAX | 0.582764 | -0.314563 | 0.720760 | -0.035587 | 0.668023 | -0.292048 | 0.506456 | -0.534432 | 0.910228 | 1.000000 | 0.460853 | -0.441808 | 0.543993 | -0.468536 |
PTRATIO | 0.289946 | -0.391679 | 0.383248 | -0.121515 | 0.188933 | -0.355501 | 0.261515 | -0.232471 | 0.464741 | 0.460853 | 1.000000 | -0.177383 | 0.374044 | -0.507787 |
B | -0.385064 | 0.175520 | -0.356977 | 0.048788 | -0.380051 | 0.128069 | -0.273534 | 0.291512 | -0.444413 | -0.441808 | -0.177383 | 1.000000 | -0.366087 | 0.333461 |
LSTAT | 0.455621 | -0.412995 | 0.603800 | -0.053929 | 0.590879 | -0.613808 | 0.602339 | -0.496996 | 0.488676 | 0.543993 | 0.374044 | -0.366087 | 1.000000 | -0.737663 |
Price | -0.388305 | 0.360445 | -0.483725 | 0.175260 | -0.427321 | 0.695360 | -0.376955 | 0.249929 | -0.381626 | -0.468536 | -0.507787 | 0.333461 | -0.737663 | 1.000000 |
Heatmap ( )
A heatmap
is a two-dimensional
graphical representation of data where the individual values that are contained in a matrix
are represented as colors
. The seaborn
python package allows the creation of annotated heatmaps
which can be tweaked using Matplotlib
tools as per the creator's requirement.
Now try look into the following script:
fig, ax = plt.subplots(figsize = (18, 10)) sns.heatmap(corrmat, annot = True, annot_kws={'size': 12}) plt.show()
corrmat.index.values
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'Price'], dtype=object)
def getCorrelatedFeature(corrdata, threshold): feature = [] value = [] for i, index in enumerate(corrdata.index): if abs(corrdata[index])> threshold: feature.append(index) value.append(corrdata[index]) df = pd.DataFrame(data = value, index = feature, columns=['Corr Value']) return df
threshold = 0.50 corr_value = getCorrelatedFeature(corrmat['Price'], threshold) corr_value
Corr Value | |
---|---|
RM | 0.695360 |
PTRATIO | -0.507787 |
LSTAT | -0.737663 |
Price | 1.000000 |
corr_value.index.values
array(['RM', 'PTRATIO', 'LSTAT', 'Price'], dtype=object)
correlated_data = data[corr_value.index] correlated_data.head()
RM | PTRATIO | LSTAT | Price | |
---|---|---|---|---|
0 | 6.575 | 15.3 | 4.98 | 24.0 |
1 | 6.421 | 17.8 | 9.14 | 21.6 |
2 | 7.185 | 17.8 | 4.03 | 34.7 |
3 | 6.998 | 18.7 | 2.94 | 33.4 |
4 | 7.147 | 18.7 | 5.33 | 36.2 |
Pairplot and Corrmat of correlated data
A pairplot plot a pairwise
relationships in a dataset. Let's look at the pair plot of correlated data.
sns.pairplot(correlated_data) plt.tight_layout()
sns.heatmap(correlated_data.corr(), annot=True, annot_kws={'size': 12},linewidth =0) plt.show()
Shuffle and Split Data
we will take the Boston housing dataset
and split the data into training and testing subsets. Typically, the data is also shuffled
into a random order
when creating the training
and testing
subsets to remove any bias in the ordering of the dataset.
Let's try to observe the following script:
X = correlated_data.drop(labels=['Price'], axis = 1) y = correlated_data['Price'] X.head()
RM | PTRATIO | LSTAT | |
---|---|---|---|
0 | 6.575 | 15.3 | 4.98 |
1 | 6.421 | 17.8 | 9.14 |
2 | 7.185 | 17.8 | 4.03 |
3 | 6.998 | 18.7 | 2.94 |
4 | 7.147 | 18.7 | 5.33 |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) X_train.shape, X_test.shape
((404, 3), (102, 3))
Lets train the mode
model = LinearRegression() model.fit(X_train, y_train) LinearRegression() y_predict = model.predict(X_test) df = pd.DataFrame(data = [y_predict, y_test]) df.T
0 | 1 | |
---|---|---|
0 | 27.609031 | 22.6 |
1 | 22.099034 | 50.0 |
2 | 26.529255 | 23.0 |
3 | 12.507986 | 8.3 |
4 | 22.254879 | 21.2 |
... | ... | ... |
97 | 28.271228 | 24.7 |
98 | 18.467419 | 14.1 |
99 | 18.558070 | 18.7 |
100 | 24.681964 | 28.1 |
101 | 20.826879 | 19.8 |
102 rows × 2 columns
Defining performance metrics
It is difficult to measure the quality
of a given model without quantifying
its performance over training
and testing
. This is typically done using some type of performance metric, whether it is through calculating some type of error
, the goodness of fit, or some other useful measurement. For this project, you will be calculating the coefficient of determination
, R2
, to quantify your model's performance. The coefficient of determination
for a model is a useful statistic in regression analysis
, as it often describes how "good" that model is at making predictions.
The values for R2
range from 0
to 1
, which captures the percentage of squared correlation
between the predicted and actual values of the target variable. A model with an R2
of 0
always fails
to predict the target variable, whereas a model with an R2
of 1
perfectly
predicts the target variable. Any value between 0
and 1
indicates what percentage
of the target variable, using this model, can be explained by the features. A model can be given a negative R2
as well, which indicates that the model is no better than one that naively predicts the m
ean of the target variable.
For the performance_metric function
in the code cell below, you will need to implement the following:
Use r2_score from sklearn.metrics
to perform a performance calculation between y_true
and y_predict
. Assign the performance score to the score variable.
Now we will find $R^2$ which is defined as follows :
$$SS_{t} = {\frac 1n\sum_{i=1}^n(y_i-\hat{y})^2}$$
$$SS_{r} = {\frac 1n\sum_{i=1}^n(y_i-\hat{y}^2}$$
$$R^{2} = 1-\frac{SS}{SS}$$ SSt = total sum of squares
SSr = total sum of squares of residuals
R2 = range from 0 to 1 and also negative if model is completely wrong.
Regression Evaluation Metrics
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors
: $$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
Mean Squared Error (MSE) is the mean of the squared errors
: $${\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors
: $$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
Comparing these metrics:
- MAE is the easiest to
understand
, because it's theaverage error
. - MSE is more popular than MAE, because MSE
"punishes"
larger errors, which tends to be useful in the real world. - RMSE is even more popular than MSE, because RMSE is
interpretable in the "y" units
.
All of these are loss functions, because we want to minimize them.
from sklearn.metrics import r2_score
correlated_data.columns
Index(['RM', 'PTRATIO', 'LSTAT', 'Price'], dtype='object')
score = r2_score(y_test, y_predict) mae = mean_absolute_error(y_test, y_predict) mse = mean_squared_error(y_test, y_predict) print('r2_score: ', score) print('mae: ', mae) print('mse: ', mse)
r2_score: 0.48816420156925056 mae: 4.404434993909258 mse: 41.67799012221684
Store feature performance
total_features = [] total_features_name = [] selected_correlation_value = [] r2_scores = [] mae_value = [] mse_value = []
def performance_metrics(features, th, y_true, y_pred): score = r2_score(y_true, y_pred) mae = mean_absolute_error(y_true, y_pred) mse = mean_squared_error(y_true, y_pred) total_features.append(len(features)-1) total_features_name.append(str(features)) selected_correlation_value.append(th) r2_scores.append(score) mae_value.append(mae) mse_value.append(mse) metrics_dataframe = pd.DataFrame(data= [total_features_name, total_features, selected_correlation_value, r2_scores, mae_value, mse_value], index = ['features name', '#feature', 'corr_value', 'r2_score', 'MAE', 'MSE']) return metrics_dataframe.T
performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|
0 | ['RM' 'PTRATIO' 'LSTAT' 'Price'] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |
Regression plot of the features correlated with the House Price
Let's try to plot the features in correlation the house price:
rows = 2 cols = 2 fig, ax = plt.subplots(nrows=rows, ncols=cols, figsize = (16, 4)) ax[0, 0].set_title("House price with respect to RM") ax[0, 1].set_title("House price with respect to PTRATIO") ax[1, 0].set_title("House price with respect to LSTAT") ax[1, 1].set_title("House price with respect to PRICE") col = correlated_data.columns index = 0 for i in range(rows): for j in range(cols): sns.regplot(x = correlated_data[col[index]], y = correlated_data['Price'], ax = ax[i][j]) index = index + 1 fig.tight_layout()
Let's find out other combination of columns to get better accuracy with >60%
corrmat['Price']
CRIM -0.388305 ZN 0.360445 INDUS -0.483725 CHAS 0.175260 NOX -0.427321 RM 0.695360 AGE -0.376955 DIS 0.249929 RAD -0.381626 TAX -0.468536 PTRATIO -0.507787 B 0.333461 LSTAT -0.737663 Price 1.000000 Name: Price, dtype: float64
threshold = 0.60 corr_value = getCorrelatedFeature(corrmat['Price'], threshold) corr_value
Corr Value | |
---|---|
RM | 0.695360 |
LSTAT | -0.737663 |
Price | 1.000000 |
correlated_data = data[corr_value.index] correlated_data.head()
RM | LSTAT | Price | |
---|---|---|---|
0 | 6.575 | 4.98 | 24.0 |
1 | 6.421 | 9.14 | 21.6 |
2 | 7.185 | 4.03 | 34.7 |
3 | 6.998 | 2.94 | 33.4 |
4 | 7.147 | 5.33 | 36.2 |
Prediction of y
from the corr_data
. This function return a predicted
value for y
.
def get_y_predict(corr_data): X = corr_data.drop(labels = ['Price'], axis = 1) y = corr_data['Price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) model = LinearRegression() model.fit(X_train, y_train) y_predict = model.predict(X_test) return y_predict
y_predict = get_y_predict(correlated_data) performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|
0 | ['RM' 'PTRATIO' 'LSTAT' 'Price'] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |
1 | ['RM' 'LSTAT' 'Price'] | 2 | 0.6 | 0.540908 | 4.14244 | 37.3831 |
Let's find out other combination of columns to get better accuracy > 70%
corrmat['Price']
CRIM -0.388305 ZN 0.360445 INDUS -0.483725 CHAS 0.175260 NOX -0.427321 RM 0.695360 AGE -0.376955 DIS 0.249929 RAD -0.381626 TAX -0.468536 PTRATIO -0.507787 B 0.333461 LSTAT -0.737663 Price 1.000000 Name: Price, dtype: float64
threshold = 0.70 corr_value = getCorrelatedFeature(corrmat['Price'], threshold) corr_value
Corr Value | |
---|---|
LSTAT | -0.737663 |
Price | 1.000000 |
correlated_data = data[corr_value.index] correlated_data.head()
LSTAT | Price | |
---|---|---|
0 | 4.98 | 24.0 |
1 | 9.14 | 21.6 |
2 | 4.03 | 34.7 |
3 | 2.94 | 33.4 |
4 | 5.33 | 36.2 |
y_predict = get_y_predict(correlated_data) performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|
0 | ['RM' 'PTRATIO' 'LSTAT' 'Price'] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |
1 | ['RM' 'LSTAT' 'Price'] | 2 | 0.6 | 0.540908 | 4.14244 | 37.3831 |
2 | ['LSTAT' 'Price'] | 1 | 0.7 | 0.430957 | 4.86401 | 46.3363 |
Let's go ahead and select only RM feature
correlated_data = data[['RM', 'Price']] correlated_data.head()
RM | Price | |
---|---|---|
0 | 6.575 | 24.0 |
1 | 6.421 | 21.6 |
2 | 7.185 | 34.7 |
3 | 6.998 | 33.4 |
4 | 7.147 | 36.2 |
y_predict = get_y_predict(correlated_data) performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|
0 | ['RM' 'PTRATIO' 'LSTAT' 'Price'] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |
1 | ['RM' 'LSTAT' 'Price'] | 2 | 0.6 | 0.540908 | 4.14244 | 37.3831 |
2 | ['LSTAT' 'Price'] | 1 | 0.7 | 0.430957 | 4.86401 | 46.3363 |
3 | ['RM' 'Price'] | 1 | 0.7 | 0.423944 | 4.32474 | 46.9074 |
Let's find out other combination of columns to get better accuracy > 40%
threshold = 0.40 corr_value = getCorrelatedFeature(corrmat['Price'], threshold) corr_value
Corr Value | |
---|---|
INDUS | -0.483725 |
NOX | -0.427321 |
RM | 0.695360 |
TAX | -0.468536 |
PTRATIO | -0.507787 |
LSTAT | -0.737663 |
Price | 1.000000 |
correlated_data = data[corr_value.index] correlated_data.head()
INDUS | NOX | RM | TAX | PTRATIO | LSTAT | Price | |
---|---|---|---|---|---|---|---|
0 | 2.31 | 0.538 | 6.575 | 296.0 | 15.3 | 4.98 | 24.0 |
1 | 7.07 | 0.469 | 6.421 | 242.0 | 17.8 | 9.14 | 21.6 |
2 | 7.07 | 0.469 | 7.185 | 242.0 | 17.8 | 4.03 | 34.7 |
3 | 2.18 | 0.458 | 6.998 | 222.0 | 18.7 | 2.94 | 33.4 |
4 | 2.18 | 0.458 | 7.147 | 222.0 | 18.7 | 5.33 | 36.2 |
y_predict = get_y_predict(correlated_data) performance_metrics(correlated_data.columns.values, threshold, y_test, y_predict)
features name | #feature | corr_value | r2_score | MAE | MSE | |
---|---|---|---|---|---|---|
0 | ['RM' 'PTRATIO' 'LSTAT' 'Price'] | 3 | 0.5 | 0.488164 | 4.40443 | 41.678 |
1 | ['RM' 'LSTAT' 'Price'] | 2 | 0.6 | 0.540908 | 4.14244 | 37.3831 |
2 | ['LSTAT' 'Price'] | 1 | 0.7 | 0.430957 | 4.86401 | 46.3363 |
3 | ['RM' 'Price'] | 1 | 0.7 | 0.423944 | 4.32474 | 46.9074 |
4 | ['INDUS' 'NOX' 'RM' 'TAX' 'PTRATIO' 'LSTAT' 'P... | 6 | 0.4 | 0.476203 | 4.3945 | 42.6519 |
Now lets go ahead and understand what is Normalization and Standardization
Standardization
Standardization
of data sets is a common requirement for many machine learning estimators
implemented in scikit-learn
; they might behave badly
if the individual features do not more or less look like standard normally distributed data
: Gaussian
with zero mean
and unit variance
.
Normalization
Normalization
is the process of scaling individual samples to have unit norm
. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel
to quantify the similarity of any pair of samples.
This assumption is the base of the Vector Space Model
often used in text classification
and clustering contexts
.
Name | Sklearn_class |
---|---|
Standard scaler | Standard scaler |
MinMaxScaler | MinMax Scaler |
MaxAbs Scaler | MaxAbs Scaler |
Robust scaler | Robust scaler |
Quantile Transformer_Normal | Quantile Transformer(output_distribution ='normal') |
Quantile Transformer_Uniform | Quantile Transformer(output_distribution = 'uniform') |
PowerTransformer-Yeo-Johnson | PowerTransformer(method = 'yeo-johnson') |
Normalizer | Normalizer |
model = LinearRegression(normalize=True) model.fit(X_train, y_train) LinearRegression(normalize=True) y_predict = model.predict(X_test) r2_score(y_test, y_predict)
0.48816420156925067
Defining performance metrics
Plotting Learning Curves
Now we will try to plot the Learning curves:
from sklearn.model_selection import learning_curve, ShuffleSplit
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 10)): plt.figure() plt.title(title) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") return plt X = correlated_data.drop(labels = ['Price'], axis = 1) y = correlated_data['Price'] title = "Learning Curves (Linear Regression) " + str(X.columns.values) cv = ShuffleSplit(n_splits=100, test_size=0.2, random_state=0) estimator = LinearRegression() plot_learning_curve(estimator, title, X, y, ylim=(0.7, 1.01), cv=cv, n_jobs=-1) plt.show()
2 Comments