Random Forest Classifier and Regressor with Python

Random Forest is one of the most practical and widely-used machine learning algorithms available today. It powers recommendation engines, fraud detection systems, and medical diagnosis tools — anywhere a single model would be too fragile, a forest of models provides the robustness you need.

A Random Forest is an ensemble method that builds many independent decision trees, each trained on a random bootstrap sample of your data and a random subset of your features. To make a prediction, every tree votes and the majority class (for classification) or the average value (for regression) becomes the final answer. This process is called bagging (bootstrap aggregating), and it dramatically reduces the variance that plagues individual deep decision trees.

In this tutorial you will build a RandomForestRegressor on the scikit-learn Diabetes dataset to predict disease progression, and a RandomForestClassifier on the Iris dataset to classify flower species. You will also extract feature importances to understand what the model learned.

Prerequisites: Python 3.x, scikit-learn, NumPy, Pandas, Matplotlib, seaborn, mlxtend.

How the Random Forest Algorithm Works

The core idea is simple: many imperfect models, averaged together, outperform any single model.

At prediction time the algorithm follows four steps:

Draw a random bootstrap sample (with replacement) from the training dataset.
Build a decision tree on that sample, but at each split consider only a random subset of features rather than all of them.
Repeat steps 1–2 to produce n_estimators independent trees.
Aggregate: for classification, take the majority vote; for regression, take the mean prediction.

The diagram below shows four trees, each trained on a different random feature subset ( $N_{1}$ through $N_{4}$ ), combining their class predictions through majority voting to produce the final class:

Random forest diagram showing four decision trees each trained on a different feature subset, combining predictions via majority voting to produce a final class label

Feature Importance via Gini Impurity

Random Forest uses Gini impurity — also called mean decrease in impurity (MDI) — to rank features. At every node of every tree, the algorithm records how much that split reduced impurity. Summing those reductions across all trees, weighted by the number of samples that passed through each node, yields an importance score for each feature. The larger the score, the more the model relies on that feature.

Random Forest vs Decision Trees

A single deep decision tree tends to memorize the training data (overfit), but a random forest prevents this by training each tree on a different data sample and forcing each split to consider only a random feature subset. The trade-off is interpretability: you can read the rules in a single tree, but you cannot easily read 100 trees at once. Use a decision tree when you need a human-readable model; use a random forest when you need higher accuracy.

Random Forest as a Regressor

This section trains a RandomForestRegressor on the Diabetes dataset, which records ten baseline medical measurements for 442 patients and a target representing disease progression one year later.

Start by importing every library you need for the regression task:

PYTHON

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

PYTHON

from sklearn import datasets, metrics
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

Load the dataset and inspect its keys:

PYTHON

diabetes = datasets.load_diabetes()
diabetes.keys()

PYTHON

dict_keys(['data', 'target', 'frame', 'DESCR', 'feature_names', 'data_filename', 'target_filename'])

Print the full dataset description to understand what each feature represents:

PYTHON

print(diabetes.DESCR)

PYTHON

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, T-Cells (a type of white blood cells)
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, thyroid stimulating hormone
      - s5      ltg, lamotrigine
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Separate the features from the target and confirm the array dimensions:

PYTHON

X = diabetes.data
y = diabetes.target
X.shape, y.shape

OUTPUT

((442, 10), (442,))

Split the data into 70 % training and 30 % test sets:

PYTHON

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

Fitting the Random Forest Regressor

A RandomForestRegressor is a meta estimator — it fits many individual decision tree regressors on random subsets of the data and averages their predictions to reduce variance. Fit the model with 100 trees and generate test-set predictions:

PYTHON

regressor = RandomForestRegressor(n_estimators=100, random_state = 42)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

The plot below overlays the predicted disease-progression values against the true test values, giving a visual sense of how closely the model tracks reality:

PYTHON

plt.figure(figsize=(16,4))
plt.plot(y_pred,  label='y_pred')
plt.plot(y_test, label='y_test')
plt.xlabel('X_test', fontsize=14)
plt.ylabel('Value of y(pred , test)', fontsize=14)
plt.title('Comparing predicted values and true values')
plt.legend(title='Parameter where:')
plt.show()

Line plot comparing the random forest regressor's predicted disease-progression values (blue) against the true test values (orange) across 133 test samples

The two lines track each other closely across most of the test range, which is a good qualitative sign before computing a numeric metric.

Evaluating Regression Error

Root Mean Squared Error (RMSE) is the square root of the average squared difference between predicted and true values — it is in the same units as the target. Calculate the RMSE using mean_squared_error():

PYTHON

np.sqrt(metrics.mean_squared_error(y_test, y_pred))

OUTPUT

53.505825893179875

To put this in context, calculate what fraction of the target's standard deviation the error represents:

PLAINTEXT

(72.78-53.50)/72.78

PLAINTEXT

0.26490794174223686

PYTHON

y_test.std()

OUTPUT

73.47317715932746

The RMSE of 53.5 is about 27 % of the target's standard deviation of 73.5, meaning the model explains a meaningful portion of the variance in disease progression — a reasonable baseline result without any hyperparameter tuning.

Random Forest as a Classifier

This section trains a RandomForestClassifier on the Iris dataset — 150 flower samples with four measurements each, split across three species.

Import the classifier and load the dataset:

PYTHON

from sklearn.ensemble import RandomForestClassifier

PYTHON

iris = datasets.load_iris()
iris.target_names

OUTPUT

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

Print the dataset description to understand the features and class distribution:

PYTHON

print(iris.DESCR)

PYTHON

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica

    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

Assign features and target, then split with stratification to preserve the class balance in both sets:

PYTHON

X = iris.data
y = iris.target

PYTHON

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1, test_size = 0.3, stratify = y)

Fit the classifier with 100 trees:

PYTHON

clf = RandomForestClassifier(n_estimators=100, random_state = 42)
clf.fit(X_train, y_train)          RandomForestClassifier(random_state=42)

Predicting and Measuring Accuracy

The predict() method assigns the majority-vote class label from all 100 trees to each test sample. Run predictions and print the accuracy score:

PYTHON

y_pred = clf.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred))

OUTPUT

0.9777777777777777

The model achieves 97.8 % accuracy on the 45-sample test set, meaning it misclassified only one flower.

Interpreting the Confusion Matrix

A confusion matrix is a square grid where each cell $(i, j)$ counts the number of samples from true class $i$ that were predicted as class $j$ . The diagonal holds correct predictions; off-diagonal cells are mistakes. Compute the raw matrix first:

PYTHON

mat = metrics.confusion_matrix(y_test, y_pred)
mat

OUTPUT

array([[15,  0,  0],
       [ 0, 15,  0],
       [ 0,  1, 14]], dtype=int64)

The classifier was perfect on setosa and versicolor, and misclassified one virginica sample as versicolor. Now plot the matrix as a colour-coded heatmap for easier reading:

PYTHON

from mlxtend.evaluate import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix
print('Confusion Matrix')
cm = confusion_matrix(y_test, y_pred)
fig, ax = plot_confusion_matrix(conf_mat = cm  )
plt.title('Relative  ratios between  actual class and predicted class ')
plt.show()

OUTPUT

Confusion Matrix

The heatmap below makes it immediately clear that the three diagonal blocks are dark (many correct predictions) and only one off-diagonal cell is non-zero:

Confusion matrix heatmap for the Random Forest Iris classifier showing 15/15 correct for setosa and versicolor, and 14/15 correct for virginica with one misclassification

Extracting Feature Importances

After training, clf.feature_importances_ returns the mean decrease in Gini impurity for each feature, normalized so that all values sum to 1. Retrieve the array and match it to the feature names:

PYTHON

clf.feature_importances_

OUTPUT

array([0.1160593 , 0.03098375, 0.43034957, 0.42260737])

PYTHON

iris.feature_names

OUTPUT

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

Petal length (43 %) and petal width (42 %) together account for 85 % of the model's discriminating power. Sepal length contributes a modest 12 %, while sepal width is almost irrelevant at 3 %. This matches the dataset's own correlation statistics, where petal measurements have correlations above 0.95.

Conclusion

In this tutorial you trained a RandomForestRegressor on the 442-sample Diabetes dataset, achieving an RMSE of 53.5 against a target standard deviation of 73.5. You then trained a RandomForestClassifier on the Iris dataset and reached 97.8 % accuracy, misclassifying just one of 45 test samples. Finally, you extracted feature importances and confirmed that petal length and petal width drive almost all of the classifier's decisions.

Key takeaways:

Random Forest reduces overfitting by training each tree on a different bootstrap sample and restricting each split to a random feature subset — neither trick alone is enough, but together they work well.
RMSE normalized against the target's standard deviation gives you a quick sense of how much variance the regressor explains without needing R² directly.
Feature importances are computed from Gini impurity reductions across all trees — they are a fast way to rank features, but can overrate high-cardinality numerical features on some datasets.
The confusion matrix is your first diagnostic after a classification run: one glance tells you which classes the model confuses and whether errors are symmetric.

Next steps:

Read Ensemble Learning in Python for a broader look at bagging, boosting, and stacking and how they compare to Random Forest.
Explore Decision Tree in Python to understand the building block that every tree in the forest is based on.
Tune your forest by experimenting with max_features (the fraction of features considered at each split) — reducing it below the default 'sqrt' often reduces correlation between trees and improves ensemble accuracy.

Random Forest Classifier and Regressor with Python

Topics You Will Master

How the Random Forest Algorithm Works

Feature Importance via Gini Impurity

Random Forest vs Decision Trees

Random Forest as a Regressor

Fitting the Random Forest Regressor

Evaluating Regression Error

Random Forest as a Classifier

Predicting and Measuring Accuracy

Interpreting the Confusion Matrix

Extracting Feature Importances

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments