#logistic regression#classification#sigmoid function#scikit-learn#python#roc curve#titanic

Logistic Regression with Python

Learn how logistic regression works — from the sigmoid function to the cost function — and build a Titanic survival classifier in Python using scikit-learn, recursive feature elimination, and ROC-AUC evaluation.

May 17, 2026 at 3:00 PM19 min readFollowFollow (Hindi)

Topics You Will Master

How logistic regression turns a linear score into a probability via the sigmoid function
The role of the decision boundary and cost function in binary classification
Building and evaluating a logistic regression classifier with scikit-learn
Using Recursive Feature Elimination (RFE) to select the most predictive features
Interpreting accuracy, log loss, and the ROC-AUC curve
Best For

Python developers and data scientists who understand basic supervised learning and want to build their first probabilistic binary classifier.

Expected Outcome

A trained logistic regression model that predicts Titanic survival with ~85 % accuracy and an AUC of ~0.90, along with feature selection and full classification metrics.

Logistic regression is one of the most widely used algorithms for binary classification. Unlike linear regression, which predicts a continuous output, logistic regression predicts the probability that an input belongs to one of two classes. That probability is then compared against a decision threshold to produce a class label — making it a fast, interpretable baseline for any classification task.

In this tutorial you will predict whether a passenger survived the Titanic disaster. You will preprocess the raw dataset, encode categorical variables, apply Recursive Feature Elimination to pick the five most informative features, and evaluate the final model with accuracy, log loss, and the Receiver Operating Characteristic (ROC) curve.

Prerequisites: Python 3.x, NumPy, Pandas, Matplotlib, Seaborn, scikit-learn.

What is Logistic Regression?

Logistic regression is a supervised classification algorithm that estimates the probability of a binary outcome. Given a set of input features , the model outputs a value between 0 and 1 representing the probability that the target variable . A decision threshold — typically 0.5 — is then applied to convert that probability into a class label.

The diagram below contrasts linear regression and logistic regression: linear regression can produce predictions outside the [0, 1] range, while logistic regression constrains its output to lie strictly between 0 and 1.

Side-by-side comparison showing that linear regression predictions can exceed the 0–1 range while logistic regression predictions are bounded between 0 and 1

Logistic regression is used everywhere you need a probabilistic yes/no answer: spam vs. not spam, fraudulent vs. legitimate transaction, malignant vs. benign tumour. The key insight is that it replaces the linear output with a nonlinear sigmoid function that squashes any real value into the (0, 1) interval.

Types of Logistic Regression

Logistic regression has three variants based on the number of target classes:

  1. Binomial — the target has exactly two classes (e.g. "survived" vs. "not survived", "pass" vs. "fail").
  2. Multinomial — the target has three or more unordered classes (e.g. "disease A" vs. "disease B" vs. "disease C").
  3. Ordinal — the target has three or more ordered classes (e.g. "poor", "good", "very good").

This tutorial focuses on the binomial case, where the target variable takes the values 0 or 1.

The Sigmoid Function

The sigmoid function maps any real-valued number to a value in the open interval (0, 1). For an input , it is defined as:

Where:

  • — the linear combination
  • — Euler's number (~2.718)
  • — the predicted probability that

The plot below shows the S-shaped curve of the sigmoid function. Notice how the output approaches 0 for very negative inputs and approaches 1 for very positive inputs.

Sigmoid function plot showing σ(z) = 1/(1+e^−z) mapping any real input z to a probability between 0 and 1

Near either extreme of the curve, changes in produce very little change in — this is the well-known vanishing gradient problem. Gradients become very small, which can slow or stall learning. In practice, this is manageable for logistic regression, and sigmoid remains the standard activation for binary output layers.

The Decision Boundary

Once the sigmoid function produces a probability, you apply a decision threshold to assign a class label. With the default threshold of 0.5, any input whose predicted probability exceeds 0.5 is classified as class 1; otherwise it is classified as class 0. The decision boundary is the hyperplane in feature space where the model's output equals exactly 0.5.

The scatter plot below illustrates a decision boundary separating two classes in a two-dimensional feature space.

Scatter plot of two binary classes (y=0 in green, y=1 in dark red) separated by a diagonal decision boundary line

Choosing the right threshold depends on your problem: a fraud-detection model might lower the threshold to catch more true positives at the cost of more false positives.

The Cost Function

In linear regression, the mean squared error cost function produces a convex surface, which is easy to optimise. If you apply MSE directly to logistic regression, the result is non-convex — full of local minima that make gradient descent unreliable. Instead, logistic regression uses log loss (also called binary cross-entropy):

Where:

  • — number of training examples
  • — true label for example (0 or 1)
  • — predicted probability for example
  • — model parameters (weights)

When the true label is 1, only the first term contributes; when the true label is 0, only the second term contributes. This formulation produces a convex surface that gradient descent can minimise reliably.

Loading the Titanic Dataset

The Titanic dataset describes 891 passengers from the 1912 disaster. Your task is to predict the survived column (1 = survived, 0 = did not survive). Start by importing all required libraries and loading the dataset from seaborn's built-in collection.

PYTHON
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import RFE
from sklearn.metrics import (
    accuracy_score, classification_report, precision_score, recall_score,
    confusion_matrix, precision_recall_curve, roc_auc_score, roc_curve, auc, log_loss
)
%matplotlib inline

Load the dataset and inspect the first ten rows:

PYTHON
titanic = sns.load_dataset('titanic')
titanic.head(10)
OUTPUT
survivedpclasssexagesibspparchfareembarkedclasswhoadult_maledeckembark_townalivealone
003male22.0107.2500SThirdmanTrueNaNSouthamptonnoFalse
111female38.01071.2833CFirstwomanFalseCCherbourgyesFalse
213female26.0007.9250SThirdwomanFalseNaNSouthamptonyesTrue
311female35.01053.1000SFirstwomanFalseCSouthamptonyesFalse
403male35.0008.0500SThirdmanTrueNaNSouthamptonnoTrue
503maleNaN008.4583QThirdmanTrueNaNQueenstownnoTrue
601male54.00051.8625SFirstmanTrueESouthamptonnoTrue
703male2.03121.0750SThirdchildFalseNaNSouthamptonnoFalse
813female27.00211.1333SThirdwomanFalseNaNSouthamptonyesFalse
912female14.01030.0708CSecondchildFalseNaNCherbourgyesFalse

Check the summary statistics for all numeric columns:

PYTHON
titanic.describe()
OUTPUT
survivedpclassagesibspparchfare
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200

Exploratory Data Analysis

Before training any model, you need to understand the data: which columns have missing values, how the age distribution looks, and how survival rates vary by passenger characteristics.

Checking for Missing Values

Identify how many null values exist in each column:

PYTHON
titanic.isnull().sum()
OUTPUT
survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

Three columns have missing values: age (177 rows), embarked (2 rows), and deck (688 rows — nearly 80 % of passengers). The heatmap below makes this pattern immediately visible.

PYTHON
sns.heatmap(titanic.isnull(), cbar = False, cmap = 'viridis')
plt.title('Number of people in the ship with respect their features ')
plt.show()

Viridis heatmap of the Titanic dataset showing missing values as bright stripes, with the deck column almost entirely missing

Calculate the exact percentage of missing age values:

PYTHON
titanic['age'].isnull().sum()/titanic.shape[0]*100
OUTPUT
19.865319865319865

About 20 % of passengers have no recorded age, so you will fill these gaps using class-based mean imputation rather than dropping the rows.

Age Distribution

Plot the density-normalised histogram of passenger ages to understand the overall distribution before imputation.

PYTHON
ax = titanic['age'].hist(bins = 30, density = True, stacked = True, color = 'teal', alpha = 0.7, figsize = (16, 5))
titanic['age'].plot(kind = 'density', color = 'teal')
ax.set_xlabel('Age')
plt.title('Percentage of the people with respect to their age ')
plt.show()

Teal histogram with a density curve showing the Titanic passenger age distribution, peaking around 20–30 years old

The distribution is right-skewed, with most passengers in their twenties and thirties. Now break that down by sex and survival outcome.

Survival by Age and Sex

The two side-by-side histograms below show how survival rates differ between female and male passengers across different age groups.

PYTHON
survived = 'survived'
not_survived = 'not survived'

fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (20, 4))
women = titanic[titanic['sex'] == 'female']
men = titanic[titanic['sex'] == 'male']

ax = sns.distplot(women[women[survived]==1].age.dropna(), bins = 18, label = survived, ax = axes[0], kde = False)
ax = sns.distplot(women[women[survived]==0].age.dropna(), bins = 40, label = not_survived, ax = axes[0], kde = False)
ax.legend()
ax.set_title('Number of female passenger whether survived or not with respect to their age ')
ax = sns.distplot(men[men[survived]==1].age.dropna(), bins = 18, label = survived, ax = axes[1], kde = False)
ax = sns.distplot(men[men[survived]==0].age.dropna(), bins = 40, label = not_survived, ax = axes[1], kde = False)
ax.legend()
ax.set_title('Number of male passenger whether survived or not with respect to their age')
plt.ylabel('No. of people')
plt.show()

Side-by-side age distribution histograms for female passengers (left) and male passengers (right), with survived and not-survived groups overlaid

Female passengers survived at much higher rates across all age groups, while male survival was relatively low regardless of age. This confirms that sex will be an important feature in the model.

Check the sex counts in the dataset:

PYTHON
titanic['sex'].value_counts()
OUTPUT
male      577
female    314
Name: sex, dtype: int64

Age and Fare by Passenger Class

Inspect how age and fare vary across the three passenger classes using box plots. First, look at age distribution by class:

PYTHON
sns.catplot(x = 'pclass', y = 'age', data = titanic, kind = 'box')
plt.title('Age of the people who are in pclass')
plt.show()

Box plot showing age distribution for each of the three passenger classes, with first-class passengers tending to be older

First-class passengers tend to be older on average than second- or third-class passengers. Now check fare distribution by class:

PYTHON
sns.catplot(x = 'pclass', y = 'fare', data = titanic, kind = 'box')
plt.title('Fare for the different classes of the pclass')
plt.show()

Box plot showing fare distribution by passenger class, with first-class fares far higher and more spread out than second and third class

First-class fares are substantially higher and more variable. You will use this relationship to compute class-based mean ages for imputation.

Compute the mean age for each passenger class:

PYTHON
titanic[titanic['pclass'] == 1]['age'].mean()
OUTPUT
38.233440860215055
PYTHON
titanic[titanic['pclass'] == 2]['age'].mean()
OUTPUT
29.87763005780347
PYTHON
titanic[titanic['pclass'] == 3]['age'].mean()
OUTPUT
25.14061971830986

Data Preprocessing

With a clear picture of the missing values and distributions, you can now clean the dataset. The steps are: impute missing ages, fill the two missing embark values, drop columns with too many missing values, and encode categorical variables as integers.

Imputing Missing Ages

Imputation replaces missing values with a computed substitute. Rather than using the global mean age, you use the mean age for each passenger class — a more accurate estimate given the age differences you saw above.

Define an imputation function that checks the passenger's class and returns the appropriate mean age:

PYTHON
def impute_age(cols):
    age = cols[0]
    pclass = cols[1]

    if pd.isnull(age):
        if pclass == 1:
            return titanic[titanic['pclass'] == 1]['age'].mean()
        elif pclass == 2:
            return titanic[titanic['pclass'] == 2]['age'].mean()
        elif pclass == 3:
            return titanic[titanic['pclass'] == 3]['age'].mean()

    else:
        return age

Apply the function to fill all missing ages:

PYTHON
titanic['age'] = titanic[['age', 'pclass']].apply(impute_age, axis = 1)

Re-plot the missing-value heatmap to confirm that the age column is now fully populated:

PYTHON
sns.heatmap(titanic.isnull(), cbar = False, cmap = 'viridis')
plt.title('Number of people with respect to their features')
plt.show()

Viridis heatmap after age imputation showing the age column now fully filled, while deck and embark_town still have missing values

The age column is clean. The deck column still has nearly all values missing, so you will drop it entirely later.

Analysing Embark Port

Before dropping columns, investigate the embarkation data to understand survival patterns by port. A FacetGrid creates a grid of subplots conditioned on a categorical variable — here, embarkation port.

PYTHON
f = sns.FacetGrid(titanic, row = 'embarked', height = 2.5, aspect= 3)
f.map(sns.pointplot, 'pclass', 'survived', 'sex', order = None, hue_order = None)
f.add_legend()
plt.show()

FacetGrid with three rows (embarked = S, C, Q) showing survival rates by passenger class and sex for each embarkation port

Female passengers consistently show higher survival rates than males across all ports and classes. Check how many rows have missing embark values:

PYTHON
titanic['embarked'].isnull().sum()
OUTPUT
2
PYTHON
titanic['embark_town'].value_counts()
OUTPUT
Southampton    644
Cherbourg      168
Queenstown      77
Name: embark_town, dtype: int64

Southampton ('S') is by far the most common embark port, so fill the two missing values with 'S':

PYTHON
common_value = 'S'
titanic['embarked'].fillna(common_value, inplace = True) titanic['embarked'].isnull().sum()
OUTPUT
0

Confirm the heatmap now shows embarked fully filled:

PYTHON
sns.heatmap(titanic.isnull(), cbar = False, cmap = 'viridis')
plt.title('Number of people with respect to their features')
plt.show()

Viridis heatmap after filling missing embarked values, showing embarked now fully populated while deck and embark_town still have gaps

Dropping Redundant Columns

Remove deck, embark_town, and alivedeck has too many missing values to be useful, embark_town is a duplicate of embarked, and alive is a direct text encoding of the target variable survived. Use drop() to remove them:

PYTHON
titanic.drop(labels=['deck', 'embark_town', 'alive'], inplace = True, axis = 1)

Re-check the heatmap to confirm all remaining columns are fully populated:

PYTHON
sns.heatmap(titanic.isnull(), cbar = False, cmap = 'viridis')
plt.title('Number of people with respect to their features')
plt.show()

Viridis heatmap after dropping deck, embark_town, and alive columns — all remaining columns show no missing values

Every remaining column is now complete. Check the full info() summary to verify column types:

PYTHON
titanic.info()
PYTHON
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   survived    891 non-null    int64
 1   pclass      891 non-null    int64
 2   sex         891 non-null    object
 3   age         891 non-null    float64
 4   sibsp       891 non-null    int64
 5   parch       891 non-null    int64
 6   fare        891 non-null    float64
 7   embarked    891 non-null    object
 8   class       891 non-null    category
 9   who         891 non-null    object
 10  adult_male  891 non-null    bool
 11  alone       891 non-null    bool
dtypes: bool(2), category(1), float64(2), int64(4), object(3)
memory usage: 65.5+ KB

Inspect the first few rows before encoding:

PYTHON
titanic.head()
OUTPUT
survivedpclasssexagesibspparchfareembarkedclasswhoadult_malealone
003male22.0107.2500SThirdmanTrueFalse
111female38.01071.2833CFirstwomanFalseFalse
213female26.0007.9250SThirdwomanFalseTrue
311female35.01053.1000SFirstwomanFalseFalse
403male35.0008.0500SThirdmanTrueTrue

Cast fare, age, and pclass to integer to reduce memory usage:

PYTHON
titanic['fare'] = titanic['fare'].astype('int')
titanic['age'] = titanic['age'].astype('int')
titanic['pclass'] = titanic['pclass'].astype('int')
titanic.info()
PYTHON
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   survived    891 non-null    int64
 1   pclass      891 non-null    int32
 2   sex         891 non-null    object
 3   age         891 non-null    int32
 4   sibsp       891 non-null    int64
 5   parch       891 non-null    int64
 6   fare        891 non-null    int32
 7   embarked    891 non-null    object
 8   class       891 non-null    category
 9   who         891 non-null    object
 10  adult_male  891 non-null    bool
 11  alone       891 non-null    bool
dtypes: bool(2), category(1), int32(3), int64(3), object(3)
memory usage: 55.0+ KB

Converting Categorical Data to Numbers

Scikit-learn requires all features to be numeric. Map each categorical column to integer codes using dictionaries, then inspect the result.

PYTHON
genders = {'male': 0, 'female': 1}
titanic['sex'] = titanic['sex'].map(genders)
who = {'man': 0, 'women': 1, 'child': 2}
titanic['who'] = titanic['who'].map(who)
adult_male = {True: 1, False: 0}
titanic['adult_male'] = titanic['adult_male'].map(adult_male)
alone = {True: 1, False: 0}
titanic['alone'] = titanic['alone'].map(alone)            ports = {'S': 0, 'C': 1, 'Q': 2}
titanic['embarked'] = titanic['embarked'].map(ports)  titanic.head()
OUTPUT
survivedpclasssexagesibspparchfareembarkedclasswhoadult_malealone
0030221070Third0.010
11113810711FirstNaN00
2131260070ThirdNaN01
31113510530FirstNaN00
4030350080Third0.011

Drop the remaining object-type columns class and who (redundant with pclass and adult_male):

PYTHON
titanic.drop(labels = ['class', 'who'], axis = 1, inplace= True)
titanic.head()
OUTPUT
survivedpclasssexagesibspparchfareembarkedadult_malealone
003022107010
1111381071100
213126007001
3111351053000
403035008011

You now have a clean, fully numeric 10-column dataset ready for modelling.

Training a Baseline Logistic Regression Model

Split the data and fit a first logistic regression model using all nine features. This gives you a baseline accuracy to compare against after feature selection.

PYTHON
X = titanic.drop('survived', axis = 1)
y = titanic['survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)                           model = LogisticRegression(solver= 'lbfgs', max_iter = 400)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
model.score(X_test, y_test)
OUTPUT
0.8271186440677966

The baseline model achieves 82.7 % accuracy. Next you will use feature selection to see whether a smaller, better-chosen feature set improves that score.

Recursive Feature Elimination

Recursive Feature Elimination (RFE) is a feature selection technique that works by repeatedly fitting a model, ranking features by their importance, and pruning the least important one at each step. It continues until the target number of features is reached.

The algorithm below illustrates the RFE procedure at each iteration:

Pseudocode for Recursive Feature Elimination showing the iterative steps of training, ranking, and pruning features

Fit an RFE wrapper around a LogisticRegression estimator, targeting the top 5 features:

PYTHON
model = LogisticRegression(solver='lbfgs', max_iter=500)
rfe = RFE(model, 5, verbose=1)
rfe = rfe.fit(X, y)
rfe.support_
PYTHON
E:\callme_conda\lib\site-packages\sklearn\utils\validation.py:68: FutureWarning: Pass n_features_to_select=5 as keyword args. From version 0.25 passing these as positional arguments will result in an error
  warnings.warn("Pass {} as keyword args. From version 0.25 "
OUTPUT
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.

array([ True, False, False,  True,  True, False, False,  True,  True])

The boolean array marks which of the nine columns were selected. Inspect the full dataset and the feature matrix to identify which column positions are True:

PYTHON
titanic.head(3)
OUTPUT
survivedpclasssexagesibspparchfareembarkedadult_malealone
003022107010
1111381071100
213126007001
PYTHON
X.head()
OUTPUT
pclasssexagesibspparchfareembarkedadult_malealone
03022107010
111381071100
23126007001
311351053000
43035008011

Filter the feature matrix down to only the RFE-selected columns:

PYTHON
XX = X[X.columns[rfe.support_]]
PYTHON
XX.head()
OUTPUT
pclasssibspparchadult_malealone
031010
111000
230001
311000
430011

RFE selected pclass, sibsp, parch, adult_male, and alone as the five most predictive features. Re-split using these columns only, then re-train:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(XX, y, test_size = 0.2, random_state = 8, stratify = y)
PYTHON
model = LogisticRegression(solver= 'lbfgs', max_iter = 500)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
model.score(X_test, y_test)
OUTPUT
0.8547486033519553

The model with just five features achieves 85.5 % accuracy — an improvement over the 82.7 % baseline with all nine features. Feature selection reduced dimensionality without sacrificing performance.

Evaluating Classification Metrics

Beyond raw accuracy, good classification evaluation requires precision, recall, log loss, and the ROC-AUC score. The diagram below defines the key metrics in terms of the confusion matrix quadrants.

Diagram showing the formulas for Precision, Recall, and Accuracy alongside a 2×2 confusion matrix grid labelled True Positive, False Positive, False Negative, True Negative

  • Accuracy — the fraction of all predictions that are correct:
  • Precision — of all positive predictions, the fraction that are truly positive:
  • Recall (sensitivity) — of all actual positives, the fraction correctly identified:

Re-fit the model and generate probability scores for the test set:

PYTHON
model = LogisticRegression(solver= 'lbfgs', max_iter = 500)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

Predicting Probabilities

The predict_proba() method returns a two-column array: the first column holds and the second holds . Index [:, 1] to extract the positive-class probabilities:

PYTHON
y_predict_prob = model.predict_proba(X_test)[:, 1]
y_predict_prob[: 5]
OUTPUT
array([0.55566832, 0.87213996, 0.09376084, 0.09376084, 0.37996908])

These probabilities feed directly into the ROC curve calculation.

Computing the ROC Curve

The Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (recall) against the False Positive Rate at every possible decision threshold. roc_curve() returns the false positive rates, true positive rates, and the corresponding threshold values, sorted from low to high:

PLAINTEXT
[fpr, tpr, thr] = roc_curve(y_test, y_predict_prob)
[fpr, tpr, thr][: 2]
PLAINTEXT
[array([0.        , 0.        , 0.        , 0.        , 0.00909091, 0.00909091, 0.00909091, 0.00909091, 0.00909091, 0.00909091, 0.03636364, 0.03636364, 0.03636364, 0.06363636, 0.09090909, 0.12727273, 0.12727273, 0.13636364, 0.21818182, 0.23636364, 0.24545455, 0.27272727, 0.29090909, 0.43636364, 0.45454545, 0.47272727, 0.52727273, 0.92727273, 1.        ]), array([0.        , 0.07246377, 0.20289855, 0.24637681, 0.33333333, 0.39130435, 0.44927536, 0.55072464, 0.60869565, 0.63768116, 0.63768116, 0.65217391, 0.69565217, 0.7826087 , 0.7826087, 0.7826087 , 0.79710145, 0.79710145, 0.86956522, 0.88405797, 0.88405797, 0.88405797, 0.88405797, 0.89855072, 0.91304348, 0.91304348, 0.92753623, 1.        , 1.        ])]

Accuracy, Log Loss, and AUC

Compute the three summary metrics for the model:

  • accuracy_score() — fraction of correct predictions
  • log_loss() — negative log-likelihood; lower is better
  • auc() — area under the ROC curve using the trapezoidal rule; 1.0 is a perfect classifier
PYTHON
print('Accuracy: ', accuracy_score(y_test, y_predict))
print('log loss: ', log_loss(y_test, y_predict_prob))
print('auc: ', auc(fpr, tpr))
OUTPUT
Accuracy:  0.8547486033519553
log loss:  0.36597373727139876
auc:  0.9007246376811595

An AUC of 0.90 is strong — it means your model correctly ranks a randomly chosen survivor above a randomly chosen non-survivor 90 % of the time.

Plotting the ROC Curve

Find the index where the true positive rate first exceeds 0.95, then draw reference lines at that operating point:

PYTHON
idx = np.min(np.where(tpr>0.95))

Plot the full ROC curve with the operating point highlighted:

PYTHON
plt.figure()
plt.plot(fpr, tpr, color = 'coral', label = "ROC curve area: " + str(auc(fpr, tpr)))
plt.plot([0, 1], [0, 1], 'k--')
plt.plot([0, fpr[idx]], [tpr[idx], tpr[idx]], 'k--', color = 'blue')
plt.plot([fpr[idx],fpr[idx]], [0,tpr[idx]], 'k--', color='blue')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - specificity)', fontsize=14)
plt.ylabel('True Positive Rate (recall)', fontsize=14)
plt.title('Receiver operating characteristic (ROC) curve')
plt.legend(loc="lower right")
plt.show()
print("Using a threshold of %.3f " % thr[idx] + "guarantees a sensitivity of %.3f " % tpr[idx] +
      "and a specificity of %.3f" % (1-fpr[idx]) +
      ", i.e. a false positive rate of %.2f%%." % (np.array(fpr[idx])*100))

ROC curve for the Titanic logistic regression model showing a coral curve well above the diagonal baseline with an AUC of approximately 0.90

PLAINTEXT
Using a threshold of 0.094 guarantees a sensitivity of 1.000 and a specificity of 0.073, i.e. a false positive rate of 92.73%.

At a threshold of 0.094 the model catches every survivor (recall = 1.0), but at the cost of a very high false positive rate — 92.7 % of non-survivors are also flagged as survivors. This illustrates the classic precision–recall trade-off: lowering the threshold increases recall but decreases specificity. For most applications you would choose a threshold that balances the two.

Conclusion

In this tutorial you built a logistic regression classifier to predict Titanic survival. You started from a raw dataset with missing values and mixed data types, cleaned and imputed the data, applied Recursive Feature Elimination to select the five most informative features, and evaluated the final model with accuracy, log loss, and the ROC-AUC curve. The final model reached 85.5 % accuracy and an AUC of 0.90 — a strong result for a linear, interpretable algorithm.

Key takeaways:

  • Logistic regression squashes a linear score through the sigmoid function to produce a probability bounded between 0 and 1.
  • The decision threshold is a hyperparameter: lowering it increases recall at the cost of more false positives.
  • Log loss is a more informative training objective than accuracy because it penalises confident wrong predictions.
  • Recursive Feature Elimination can improve generalisation by removing noisy or redundant features.
  • An AUC of 0.90 means the model ranks positive examples above negative examples 90 % of the time, regardless of threshold choice.

Next steps:

  • Explore Support Vector Machines to see how a margin-based classifier handles the same binary classification problem.
  • Read Random Forest Classifier and Regressor to understand how an ensemble of decision trees can outperform a single linear model.
  • Try adjusting the C regularisation parameter in LogisticRegression to control the bias–variance trade-off and observe how accuracy and AUC change on the held-out test set.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments