# Decision Tree Machine Learning in Python KGP Talkie

For detailed theory read An introduction to Statistical Learning:

A `decision tree`

is a flowchart-like `tree structure`

where an `internal node`

represents feature, the branch represents a `decision rule`

, and each `leaf node`

represents the `outcome`

. The `topmost node`

in a decision tree is known as the `root node`

. It learns to partition on the basis of the `attribute value`

. It partitions the tree in `recursively`

manner `call recursive`

partitioning. This flowchart-like structure helps you in `decision making`

. It’s visualization like a `flowchart diagram`

which easily mimics the `human level`

thinking. That is why `decision trees`

are easy to understand and interpret.

### Example

## Why Decision Tree

- Decision tress often mimic the
`human level thinking`

so its so simple to understand the data and make some good interpretations. - Decision trees actually make you see the
`logic`

for the data to interpret(not like black box algorithms like SVM,NN,etc..) .

## How Decision Tree Works

- Select the best attribute using
`Attribute Selection Measures(ASM)`

to split the records. - Make that attribute a decision node and breaks the dataset into smaller subsets.
- Starts tree building by repeating this process
`recursively`

for each child until one of the condition will match:- All the
`tuples`

belong to the same attribute value. - There are no more remaining
`attributes`

. - There are no more
`instances`

.

- All the

Here couple of algorithms to build a decision tree, we only talk about a few which are:

CART (Classification and Regression Trees) → uses Gini Index(Classification) as metric.

ID3 (Iterative Dichotomiser 3) → uses Entropy function and Information gain as metrics.

## Decision Making in DT with Attribute Selection Measures(ASM)

- Information Gain
- Gain Ratio
- Gini Index

Read Chapter 8: http://faculty.marshall.usc.edu/gareth-james/ISL/

### Information Gain

In order to define information gain precisely, we begin by defining a measure commonly used in `information theory`

, called `entropy`

that characterizes the (im)purity of an arbitrary collection of examples.

### Entropy

Entropy is the measure of the amount of `uncertainity`

in the data set.

$$ H(S) = \sum_{c=C}-p(c) \cdot log_2 p(c) $$

where S = The current data set for which entropy is being calculated C – Set of classes in S C={yes , no} p(c) = The set S is perfectly classified

### Information gain

Information gain calculates the `reduction`

in `entropy`

or `surprise`

from transforming a dataset in some way. It is the measure of the measure of the `difference`

in entropy from before to after the set `S`

is split on an attribute A. It is commonly used in the construction of `decision trees`

from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the `information gain`

, which in turn minimizes the `entropy`

and best splits the dataset into groups for effective classification.

$$ IG(A,S) = H(S) – \sum_{t}p(t) \cdot H(t) $$

where

- H(S) = Entropy of set S
- T = The subsets created from splitting set S by attribute A
- H(t) = Entropy of subset t

- compute the
`entropy`

for data-set - for every feature:

1.calculate`entropy`

for all categorical values.

2.take average information`entropy`

for the current attribute .

3.calculate gain for the current attribute. 4.pick the highest gain attribute - Repeat until we get the tree we desired.

### Gain Ratio

Gain ratio tries to the `correct`

the information gain’s bias towards attributes with many possible values by adding a denominator to information gain called `split information`

. Split Information tries to measure how broadly and uniformly the attribute `splits`

the data:

*$$ \text SplitInformation(S, A) = – \sum_{i=1}^{c}rac{|S_i|}{|S|} \cdot log_2 rac{|S_i|}{|S|}$$*

The Gain Ratio is defined in terms of `Gain`

and `SplitInformation`

as,

$$Gain Ratio(S, A) \equiv rac{Gain(S, A)}{SplitInformation(S, A)}$$

### Gini Index

`Gini Index`

is a measurement of the likelihood of an incorrect classification of a new instance of a `random variable`

, if that new instance were randomly classified according to the distribution of class labels from the data set.

If our dataset is Pure then likelihood of incorrect classification is `0`

. If our sample is mixture of different classes then likelihood of incorrect classification will be `high`

.

## Optimizing DT

**Criterion :**

Optional (default=”gini”) or Choose attribute selection measure: This parameter allows us to use the different-different attribute selection `measure`

. Supported criteria are `“gini”`

for the `Gini index`

and `entropy`

for the `information gain`

.

**Splitter :**

String, optional (default=”best”) or Split Strategy: This parameter allows us to choose the `split strategy`

. Supported strategies are `“best”`

to choose the best split and `“random”`

to choose the best random split.

**Max_depth :**

Int or None, optional (default=’None’) or Maximum Depth of a Tree: The `maximum depth`

of the tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples. The higher value of `maximum depth`

causes `overfitting`

, and a lower value causes `underfitting (Source)`

.

### Recursive Binary Splitting

In this procedure all the features are considered and different `split points`

are tried and tested using a `Cost function`

. The split with the `best cost (or lowest cost)`

is selected.

### When to stop splitting?

You might ask when to `stop`

growing a tree? As a problem usually has a large set of features, it results in large number of split, which in turn gives a `huge`

tree. Such trees are complex and can lead to `overfitting`

. So, we need to know when to stop?

### Pruning

The `performance`

of a tree can be further increased by `pruning`

. It involves removing the branches that make use of features having `low`

importance. This way, we reduce the complexity of tree, and thus increasing its predictive power by reducing `overfitting`

.

## Decision Tree Regressor

import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

from sklearn import datasets, metrics from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeRegressor

diabetes = datasets.load_diabetes() diabetes.keys()

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

print (diabetes.DESCR)

.. _diabetes_dataset: Diabetes dataset ---------------- Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. **Data Set Characteristics:** :Number of Instances: 442 :Number of Attributes: First 10 columns are numeric predictive values :Target: Column 11 is a quantitative measure of disease progression one year after baseline :Attribute Information: - age age in years - sex - bmi body mass index - bp average blood pressure - s1 tc, T-Cells (a type of white blood cells) - s2 ldl, low-density lipoproteins - s3 hdl, high-density lipoproteins - s4 tch, thyroid stimulating hormone - s5 ltg, lamotrigine - s6 glu, blood sugar level Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

diabetes.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

diabetes.target[: 10]

array([151., 75., 141., 206., 135., 97., 138., 63., 110., 310.])

X = diabetes.data y = diabetes.target X.shape, y.shape

((442, 10), (442,))

df = pd.DataFrame(X, columns=diabetes.feature_names) df['target'] = y df.head()

age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | target | |
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0.038076 | 0.050680 | 0.061696 | 0.021872 | -0.044223 | -0.034821 | -0.043401 | -0.002592 | 0.019908 | -0.017646 | 151.0 |

1 | -0.001882 | -0.044642 | -0.051474 | -0.026328 | -0.008449 | -0.019163 | 0.074412 | -0.039493 | -0.068330 | -0.092204 | 75.0 |

2 | 0.085299 | 0.050680 | 0.044451 | -0.005671 | -0.045599 | -0.034194 | -0.032356 | -0.002592 | 0.002864 | -0.025930 | 141.0 |

3 | -0.089063 | -0.044642 | -0.011595 | -0.036656 | 0.012191 | 0.024991 | -0.036038 | 0.034309 | 0.022692 | -0.009362 | 206.0 |

4 | 0.005383 | -0.044642 | -0.036385 | 0.021872 | 0.003935 | 0.015596 | 0.008142 | -0.002592 | -0.031991 | -0.046641 | 135.0 |

### Pairplot()

By default, this function will create a `grid`

of Axes such that each numeric variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. The `diagonal`

Axes are treated differently, drawing a plot to show the `univariate`

distribution of the data for the variable in that column.

sns.pairplot(df) plt.show()

## Decision Tree Regressor

Let’s see decision tree as regressor:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

regressor = DecisionTreeRegressor(random_state=42) regressor.fit(X_train, y_train) y_pred = regressor.predict(X_test)

The following plot shows predicted values of y and true values of y:

plt.figure(figsize=(16, 4)) plt.plot(y_pred, label='y_pred') plt.plot(y_test, label='y_test') plt.xlabel('X_test', fontsize=14) plt.ylabel('Value of y(pred , test)', fontsize=14) plt.title('Comparing predicted values and true values') plt.legend(title='Parameter where:') plt.show()

Now, we will try to get the Root Mean Square Error of the data by using the function mean_squared_error().Let’s see the following code:

np.sqrt(metrics.mean_squared_error(y_test, y_pred))

70.61829663921893

y_test.std()

72.78840394263774

## Decision Tree as a Classifier

Let’s see decision tree as classifier:

from sklearn.tree import DecisionTreeClassifier

Use iris data set:

iris = datasets.load_iris() iris.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

iris.feature_names

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

X = iris.data y = iris.target df = pd.DataFrame(X, columns=iris.feature_names) df['target'] = y df.head()

sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|

0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |

1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |

2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |

3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |

4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |

sns.pairplot(df) plt.show()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, test_size = 0.2, stratify = y) clf = DecisionTreeClassifier(criterion='gini', random_state=1) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print('Accuracy: ', metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9666666666666667

Now we will evaluate the accuracy of the classifier by using `confusion marix`

. For this we will use function `confusion_matrix()`

. Each cell in the square box represents relateve or absolute ratios between `y_test`

and `y_pred`

.

Now let’s see the following script :

from mlxtend.evaluate import confusion_matrix from mlxtend.plotting import plot_confusion_matrix print('Confusion Matrix') cm = confusion_matrix(y_test, y_pred) fig, ax = plot_confusion_matrix(conf_mat=cm) plt.title('Relative ratios between actual class and predicted class ') plt.show()

Confusion Matrix

### Classification_report()

The classification_report function builds a text report showing the main classification metrics. Here is a small example with custom `target_names`

and inferred labels.Now we will use of this function in the following code:

print(metrics.classification_report(y_test, y_pred))

precision recall f1-score support 0 1.00 1.00 1.00 10 1 0.91 1.00 0.95 10 2 1.00 0.90 0.95 10 accuracy 0.97 30 macro avg 0.97 0.97 0.97 30 weighted avg 0.97 0.97 0.97 30

## 0 Comments