Improve Training Time of Machine Learning Model Using Bagging | KGP Talkie
How bagging works
First of all we will try to understand what
Bagging is from the following diagram:
Let’s say we have a dataset to train the model, first we need to divide this dataset into
number of datasets(atleast more than 2).And then we need to apply
`classifier on each of the dataset seperately then finally we do aggregation
to get theoutput`.
SVM time complexity = O(n3)
i.e. As we increase number of input samples, training time increases cubically.
1000 input samples take
10 seconds to train then
3000 input samples might take
10 * 3^3 seconds to train.
If we divide
3000 samples into
3 categories each dataset contain
1000 samples. To train each dataset it will take
10sec in this way it will get over all time to train is
30sec(10+10+10). In this way we can improve
training time of machine learning model.
Instead of 270sec, if we divide into 3 sets it will take 30sec.
Let’s have a look into the following script:
Importing required libraries
import numpy as np from sklearn.ensemble import BaggingClassifier from sklearn import datasets from sklearn.svm import SVC
Importing iris dataset
iris = datasets.load_iris() iris.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
Let’s see the description of the iris dataset
.. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%[email protected]) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
X = iris.data y = iris.target
Let’s go ahead and get the shape of these x and y
((150, 4), (150,))
So x has
150 samples and
4 attributes . y has
We know 150 samples are less number of samples in machine leaning.So to get more number of samples we will use repeat() function .
We will repeat these
150 samples to
500 times then we will get 7
X = np.repeat(X, repeats=500, axis = 0) y = np.repeat(y, repeats=500, axis = 0)
((75000, 4), (75000,))
Now to train the model without bagging we will create a classifier called
%%time clf = SVC(kernel='linear', probability=True, class_weight='balanced') clf.fit(X, y) print('SVC: ', clf.score(X, y))
SVC: 0.98 Wall time: 34.5sec
Train it with
Now to train the model with bagging we will create a classifier called
%%time n_estimators = 10 clf = BaggingClassifier(SVC(kernel='linear', probability=True, class_weight='balanced'), n_estimators=n_estimators, max_samples=1.0/n_estimators) clf.fit(X, y) print('SVC: ', clf.score(X, y))
SVC: 0.98 Wall time: 10.5 s
So from the above result we can observe improvement training time of the model from