#multi-label classification#text classification#tf-idf#onevsrest classifier#multilabelbinarizer#hamming loss#jaccard score#natural language processing#scikit-learn#python

Multi-Label Text Classification on Stack Overflow Tag Prediction

Build a multi-label text classifier that predicts Stack Overflow tags from question text using TF-IDF vectorization and the OneVsRest strategy, then evaluate it with Hamming loss and Jaccard score.

May 20, 2026 at 8:15 AM11 min readFollowFollow (Hindi)

Topics You Will Master

The difference between binary, multi-class, and multi-label classification
How to encode multi-tag labels using MultiLabelBinarizer
How to convert question text into numbers using TF-IDF vectorization
How to train a multi-label model with OneVsRestClassifier
How to measure multi-label quality with Hamming loss and Jaccard score
Best For

Python developers and data scientists who know basic supervised learning and want to build a text classifier that can assign several labels to a single document at the same time.

Expected Outcome

A trained tag-prediction model that reads a Stack Overflow question and returns multiple relevant programming tags (for example, python, sql, c#), evaluated with Jaccard score so you can compare classifiers fairly.

Every Stack Overflow question can carry several tags at once — a question about reading a CSV file in Python might be tagged python, pandas, and csv simultaneously. A standard classifier that predicts only one label cannot handle this. You need multi-label classification — a technique where a model predicts any number of labels for each input.

In this tutorial you will build a tag-suggestion system for Stack Overflow questions. You will load a real dataset of 48 000 questions, clean and encode the tags, turn question text into TF-IDF vectors, and train three classifiers (SGDClassifier, LogisticRegression, and LinearSVC) wrapped in the OneVsRestClassifier strategy. You will evaluate each model using Jaccard score.

Prerequisites: Python 3.x, scikit-learn, Pandas, NumPy.

Classification Task Types

Before diving into the code, it helps to understand the four main types of classification tasks, because multi-label is one specific variant.

Binary classification — the output is one of exactly two classes.

  • Email spam detection (spam or not spam)
  • Churn prediction (will churn or will not)
  • Conversion prediction (buy or not buy)

Multi-class classification — the output is one of more than two classes, but still only one class per sample.

  • Identifying a fruit as apple, pear, or banana — it can only be one at a time.

Multi-label classification — the output can be any combination of classes. One or more labels may apply to the same sample simultaneously.

  • A photo of a person riding a bicycle near a tree could be tagged person, bicycle, and tree all at once.
  • A Stack Overflow question about reading files in Python could be tagged python and pandas together.

Imbalanced classification — the classes in the dataset are not evenly distributed. Fraud detection and outlier detection are typical examples.

The diagram below shows the structural difference between binary, multi-class, and multi-label tasks — notice that multi-label allows multiple checkmarks on the same sample:

Diagram comparing binary, multi-class, and multi-label classification — binary picks one of two labels, multi-class picks one of many, multi-label picks all applicable

Notebook Setup

Import all the libraries you will need for this tutorial in one block:

PYTHON
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.multiclass import OneVsRestClassifier

Load the Stack Overflow dataset directly from a public CSV. The index_col=0 argument tells Pandas to use the first column as the row index:

PYTHON
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/stackoverflow.csv', index_col=0)

Inspect the first five rows to understand the structure:

PYTHON
df.head()
OUTPUT
TextTags
2aspnet site maps has anyone got experience cre...['sql', 'asp.net']
4adding scripting functionality to net applicat...['c#', '.net']
5should i use nested classes in this case i am ...['c++']
6homegrown consumption of web services i have b...['.net']
8automatically update version number i would li...['c#']

The Tags column looks like a list, but check its actual type:

PYTHON
type(df['Tags'].iloc[0])
OUTPUT
str

It is a plain string, not a Python list. Confirm what the raw string looks like:

PYTHON
df['Tags'].iloc[0]
OUTPUT
"['sql', 'asp.net']"

Parsing the Tag Strings

Because each tag cell is stored as a string representation of a list, you need to convert it to a real Python list before encoding. The ast module — part of the Python standard library — provides literal_eval, which safely parses a string that looks like a Python literal (list, dict, number, etc.) into the corresponding Python object.

Use ast.literal_eval on one row to verify it works:

PYTHON
import ast
ast.literal_eval(df['Tags'].iloc[0])
OUTPUT
['sql', 'asp.net']

Apply it to the entire Tags column with lambda:

PYTHON
df['Tags'] = df['Tags'].apply(lambda x: ast.literal_eval(x))
df.head()
OUTPUT
TextTags
2aspnet site maps has anyone got experience cre...[sql, asp.net]
4adding scripting functionality to net applicat...[c#, .net]
5should i use nested cases in this case i am ...[c++]
6homegrown consumption of web services i have b...[.net]
8automatically update version number i would li...[c#]

The tags are now real Python lists, ready for encoding.

Encoding the Tag Labels

Machine learning models require numerical inputs. The tags are text labels, so you need to convert them to numbers. There are several standard encoding strategies — it is worth understanding each one before choosing the right tool for multi-label problems.

Label Encoding

LabelEncoder maps each unique text label to an integer (0, 1, 2, …). It works well for binary columns like Yes/No or Male/Female, where order is meaningful or there are only two values. For multi-class columns it creates a false ordering — the model may interpret 0 < 1 < 2 as a numerical relationship between the classes, which is not correct.

One-Hot Encoding

OneHotEncoder converts a categorical column into one binary column per class. A value of 1 means the class is present; 0 means it is absent. The diagram below shows a food dataset converted from label encoding to one-hot encoding — each food name becomes its own column:

Table showing how Label Encoding assigns integers to food names (Apple=1, Chicken=2, Broccoli=3) while One Hot Encoding creates a separate binary column for each food

A known pitfall of one-hot encoding is the variable trap (also called multicollinearity): if you know the values of all but one column, you can always infer the last. For example, knowing Apple=0 and Chicken=0 means Broccoli=1 is certain. Use the drop parameter in OneHotEncoder to remove one redundant column.

The Pandas get_dummies() function does the same thing in one step. The drop_first=True argument drops the first dummy column to eliminate the variable trap automatically.

MultiLabel Binarizer

MultiLabelBinarizer is designed specifically for multi-label targets. It accepts a column where each cell is a list of labels and outputs a binary matrix — one column per unique label, with 1 indicating that label is present for a given row.

Extract the target column and confirm its shape before encoding:

PYTHON
y = df['Tags']
y
OUTPUT
2          [sql, asp.net]
4              [c#, .net]
5                   [c++]
6                  [.net]
8                    [c#]
                ...
1262668             [c++]
1262834             [c++]
1262915          [python]
1263065          [python]
1263454             [c++]
Name: Tags, Length: 48976, dtype: object

Fit the MultiLabelBinarizer and transform the tag lists into a binary array:

PYTHON
multilabel = MultiLabelBinarizer()
y = multilabel.fit_transform(df['Tags'])
y
OUTPUT
array([[0, 0, 1, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

Check which class each column represents:

PYTHON
multilabel.classes_
OUTPUT
array(['.net', 'android', 'asp.net', 'c', 'c#', 'c++', 'css', 'html',
       'ios', 'iphone', 'java', 'javascript', 'jquery', 'mysql',
       'objective-c', 'php', 'python', 'ruby', 'ruby-on-rails', 'sql'],
      dtype=object)

There are 20 unique tags in the dataset. View the full binary label matrix as a readable DataFrame:

PYTHON
pd.DataFrame(y, columns=multilabel.classes_)
OUTPUT
.netandroidasp.netcc#c++csshtmliosiphonejavajavascriptjquerymysqlobjective-cphppythonrubyruby-on-railssql
000100000000000000001
110001000000000000000
200000100000000000000
310000000000000000000
400001000000000000000
...............................................................
4897100000100000000000000
4897200000100000000000000
4897300000000000000001000
4897400000000000000001000
4897500000100000000000000

48976 rows × 20 columns

Each row is now a 20-element binary vector that tells the model which tags belong to that question.

Text Vectorization with TF-IDF

Machine learning models cannot accept raw text — every word must be converted to a number first. The process of turning text into a numerical matrix is called text vectorization.

TF-IDF (Term Frequency–Inverse Document Frequency) is the most common vectorization method for classification tasks. It assigns each word a score that reflects how important the word is to a specific document relative to the entire collection. A word that appears often in one document but rarely across all documents gets a high score; common words like "the" that appear everywhere get a low score.

The score is the product of two values:

Where:

  • — the term (word or n-gram) being scored
  • — the individual document (question) being scored
  • — the full collection of documents (the entire dataset)
  • — term frequency: how often term appears in document
  • — inverse document frequency: a penalty for terms that appear in many documents across ; common terms get a lower weight

Scikit-learn provides three vectorizers:

  • CountVectorizer — counts how many times each word appears in each document.
  • TfidfVectorizer — computes the TF-IDF score for each word, downweighting common words.
  • HashingVectorizer — applies a hashing function to word counts, which saves memory but is not reversible.

Fit the TfidfVectorizer on the question text. The max_features=10000 argument limits the vocabulary to the 10 000 most important terms; ngram_range=(1,3) captures single words and phrases up to three words long:

PYTHON
tfidf = TfidfVectorizer(analyzer='word', max_features=10000, ngram_range=(1,3), stop_words='english')
X = tfidf.fit_transform(df['Text'])
X.shape, y.shape

Split the data into 80 % training and 20 % test sets:

PYTHON
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

Defining the Classifiers

Instantiate the three classifiers you will compare. Each one will later be wrapped inside OneVsRestClassifier:

PYTHON
sgd = SGDClassifier()
lr = LogisticRegression(solver='lbfgs')
svc = LinearSVC()

Metrics for Multi-Label Classification

Standard accuracy is not meaningful for multi-label problems because a prediction is rarely either fully right or fully wrong — it can be partially correct. Two metrics designed for multi-label evaluation are Hamming loss and Jaccard score.

Hamming Loss

Hamming loss — the fraction of labels that are incorrectly predicted, averaged over all samples. It counts every individually wrong label slot. A lower value is better; the perfect score is 0.

Where:

  • — the number of samples
  • — the total number of possible labels
  • — the predicted label for sample , label position
  • — the true label for sample , label position
  • — the indicator function, equal to 1 when the condition inside is true

Jaccard Score

Jaccard score (also called the Jaccard index or Intersection over Union) — the size of the intersection between the predicted labels and the true labels divided by the size of their union. It ranges from 0 to 1, and 1 is the perfect score.

Where:

  • — the set of true labels for a sample
  • — the set of predicted labels for the same sample
  • — the number of labels that are both true and predicted (true positives)
  • — the number of unique labels that appear in either set (true positives + false positives + false negatives)

Define helper functions to compute and print the Jaccard score for any classifier:

PYTHON
def j_score(y_true, y_pred):
  jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
  return jaccard.mean()*100

def print_score(y_pred, clf):
  print("Clf: ", clf.__class__.__name__)
  print('Jacard score: {}'.format(j_score(y_test, y_pred)))
  print('----')

OneVsRest Classifier

The OneVsRest strategy (one-vs-all) is the standard way to adapt a binary classifier to multi-class or multi-label problems. It works by training one separate binary classifier per label. Each binary classifier learns to distinguish "is this label present?" from "is it absent?", treating all other labels as the negative class.

For 20 Stack Overflow tags, OneVsRestClassifier trains 20 independent binary models and combines their predictions into a final multi-label output. Any binary classifier — LinearSVC, SGDClassifier, or LogisticRegression — can be passed as the base estimator.

First, train a tuned LinearSVC alone to establish a baseline:

PYTHON
for classifier in [LinearSVC(C=1.5, penalty = 'l1', dual=False)]:
  clf = OneVsRestClassifier(classifier)
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print_score(y_pred, classifier)

Then compare all three default classifiers in a loop:

PYTHON
for classifier in [sgd, lr, svc]:
  clf = OneVsRestClassifier(classifier)
  clf.fit(X_train, y_train)
  y_pred = clf.predict(X_test)
  print_score(y_pred, classifier)

The classifier with the highest Jaccard score is the best choice for this dataset.

Testing the Model on Real Data

To verify that the model works in practice, pass a new, unseen question as a plain string:

PYTHON
x = [ 'how to write ml code in python and java i have data but do not know how to do it']

Transform the text with the same TF-IDF vectorizer used during training, then predict:

PYTHON
xt = tfidf.transform(x)
clf.predict(xt)

Decode the binary prediction back to human-readable tag names:

PYTHON
multilabel.inverse_transform(clf.predict(xt))

inverse_transform converts the binary output array back into the list of tag strings that the model predicted for this question.

Conclusion

In this tutorial you built a complete multi-label tag-prediction pipeline for Stack Overflow questions. You parsed tag strings with ast.literal_eval, encoded them into a binary matrix with MultiLabelBinarizer, converted question text into TF-IDF vectors, and trained SGDClassifier, LogisticRegression, and LinearSVC classifiers wrapped in the OneVsRestClassifier strategy. You evaluated each model using Jaccard score, which correctly handles partially correct multi-label predictions.

Key takeaways:

  • Multi-label classification allows a model to assign any number of labels per sample — a standard single-output classifier cannot do this.
  • MultiLabelBinarizer converts a list of tag strings into a binary matrix that scikit-learn models can consume.
  • TF-IDF downweights common words and boosts rare but informative words, making it a strong baseline for text classification.
  • OneVsRestClassifier decomposes any multi-label problem into one binary classifier per label, so you can use any standard binary estimator.
  • Jaccard score is the right evaluation metric for multi-label problems: accuracy and F1 alone do not capture partial matches correctly.

Next steps:

  • Explore Sentiment Analysis with scikit-learn to apply similar TF-IDF and classification techniques to opinion mining.
  • Read Logistic Regression in Python for a deep dive into one of the base classifiers used here and how its coefficients relate to feature importance.
  • Try SVM with Python to understand the theoretical basis behind LinearSVC and how the margin hyperparameter C controls the bias-variance trade-off.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments