Data Variable Types Every Data Scientist Needs

When embarking on exploratory data analysis, your first task is identifying what types of data you are working with. Different statistical variables require entirely different preprocessing, visualization, and modeling strategies. Misidentifying a variable type can lead to inappropriate charts, broken models, or lost information.

A variable is any characteristic, quantity, or number that can be measured or counted. In programming and statistics, they are named variables because their values vary across different observations.

In this tutorial, you will explore the four major types of variables: numerical, categorical, date-time, and mixed variables. You will use two real-world finance datasets containing customer IDs, loan amounts, interest rates, and employment categories to learn how to identify, inspect, and visualize each data type using histograms and bar charts.

Prerequisites: Python 3.x, Pandas, NumPy, Matplotlib.

What Is a Variable?

A variable is any characteristic, number, or quantity that can be measured or counted. They are called 'variables' because the value they take may vary, and it usually does. In programming, a variable is a value that can change, depending on conditions or on information passed to the program.

The following are examples of variables:

Age (21, 35, 62, ...)
Gender (male, female)
Income (GBP 20000, GBP 35000, GBP 45000, ...)
House price (GBP 350000, GBP 570000, ...)
Country of birth (China, Russia, Costa Rica, ...)
Eye colour (brown, green, blue, ...)
Vehicle make (Ford, Volkswagen, ...)

Most variables in a data set can be classified into one of four major types:

Numerical variables
Categorical variables
Date-Time Variables
Mixed Variables

Numeric Variables

The values of a numerical variable are numbers.

They can be further classified into:

Discrete variables
Continuous variables

Discrete Variable

In a discrete variable, the values are whole numbers (counts). For example, the number of items bought by a customer in a supermarket is discrete. The customer can buy 1, 25, or 50 items, but not 3.7 items. It is always a round number.

The following are examples of discrete variables:

Number of active bank accounts of a borrower (1, 4, 7, ...)
Number of pets in the family
Number of children in the family

Continuous Variable

A variable that may contain any value within a range is continuous. For example, the total amount paid by a customer in a supermarket is continuous. The customer can pay, GBP 20.5, GBP 13.10, GBP 83.20 and so on.

Other examples of continuous variables are:

House price (in principle, it can take any value) (GBP 350000, 57000, 100000, ...)
Time spent surfing a website (3.4 seconds, 5.10 seconds, ...)
Total debt as percentage of total income in the last month (0.2, 0.001, 0, 0.75, ...)

Demo Dataset

In this demo, we will use a toy data set which simulates data from a peer-to-peer finance company to inspect numerical and categorical variables.

customer_id: Unique ID for each customer
disbursed_amount: loan amount given to the borrower
interest: interest rate
income: annual income
number_open_accounts: open accounts (more on this later)
number_credit_lines_12: accounts opened in the last 12 months
target: loan status(paid or being repaid = 1, defaulted = 0)
loan_purpose: intended use of the loan
market: the risk market assigned to the borrower (based in their financial situation)
householder: whether the borrower owns or rents their property
date_issued: date the loan was issued
date_last_payment: date of last payment towards repyaing the loan

Setup

Before loading and analyzing the dataset, import the standard data processing and plotting libraries:

PYTHON

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Load the peer-to-peer lending CSV file directly from the repository into a Pandas DataFrame:

PYTHON

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/loan.csv')
data.head()

OUTPUT

	customer_id	disbursed_amount	interest	market	employment	time_employed	householder	income	date_issued	loan_purpose	number_open_accounts	date_last_payment	number_credit_lines_12
0	0	23201.5	15.4840	C	Teacher	<=5 years	RENT	84600.0	2013-06-11	Debt consolidation	4.0	2016-01-14	NaN
1	1	7425.0	11.2032	B	Accountant	<=5 years	OWNER	102000.0	2014-05-08	Car purchase	13.0	2016-01-25	NaN
2	2	11150.0	8.5100	A	Statistician	<=5 years	RENT	69840.0	2013-10-26	Debt consolidation	8.0	2014-09-26	NaN
3	3	7600.0	5.8656	A	Other	<=5 years	RENT	100386.0	2015-08-20	Debt consolidation	20.0	2016-01-26	NaN
4	4	31960.0	18.7392	E	Bus driver	>5 years	RENT	95040.0	2014-07-22	Debt consolidation	14.0	2016-01-11	NaN

Continuous Variables

Let's look at the values of the variable disbursed_amount. It is the amount of money requested by the borrower. This variable is continuous and can take any principal value.

Check all the unique values present in the column:

PYTHON

data['disbursed_amount'].unique()

OUTPUT

array([23201.5 ,  7425.  , 11150.  , ...,  6279.  , 12894.75, 25584.  ])

To construct a histogram from a continuous variable, you first split the data range into equal-width intervals called bins. The histogram below plots the distribution of requested loan amounts across 50 bins:

PYTHON

fig = data['disbursed_amount'].hist(bins=50)

# title and axis labels
fig.set_title('Loan Amount Requested')
fig.set_xlabel('Loan Amount')
fig.set_ylabel('Number of Loans')

The resulting plot shows a continuous distribution of values across the entire range with no gaps between adjacent bars:

Histogram showing the distribution of requested loan amounts

Similarly, the interest column contains the continuous interest rates charged to the borrowers:

PYTHON

data['interest'].unique()

OUTPUT

array([15.484 , 11.2032,  8.51  , ..., 12.9195, 11.2332, 11.0019])

Plot the interest rates using 30 histogram bins:

PYTHON

fig = data['interest'].hist(bins=30)
fig.set_title('Interest Rate')
fig.set_xlabel('Interest Rate')
fig.set_ylabel('Number of Loans')

The interest rates also vary continuously, showing a skewed distribution without gaps:

Histogram of interest rate distribution

Now, let's explore the income declared by the customers, that is, how much they earn yearly. This variable is also continuous.

Check the unique values of annual income declared by borrowers:

PYTHON

data['income'].unique()

OUTPUT

array([ 84600.  , 102000.  ,  69840.  , ..., 228420.  , 133950.  ,
        55941.84])

Plot the income distribution using 100 bins, limiting the x-axis to $400,000 for better readability:

PYTHON

fig = data['income'].hist(bins=100)

# For better visualisation, only specific range is displayed on the x-axis
fig.set_xlim(0, 400000)

# title and axis labels
fig.set_title("Customer's Annual Income")
fig.set_xlabel('Annual Income')
fig.set_ylabel('Number of Customers')

Most borrowers report annual incomes between $30, 000 an d$ 70,000, with a long right tail of higher earners:

Histogram of customer annual income distribution

Discrete Variables

Next, inspect the number of open credit lines in the borrower's credit file (number_open_accounts). Because a customer can only open integer numbers of accounts, this is a discrete variable:

PYTHON

data['number_open_accounts'].unique()

OUTPUT

array([ 4., 13.,  8., 20., 14.,  5.,  9., 18., 16., 17., 12., 15.,  6.,
       10., 11.,  7., 21., 19., 26.,  2., 22., 27., 23., 25., 24., 28.,
        3., 30., 41., 32., 33., 31., 29., 37., 49., 34., 35., 38.,  1.,
       36., 42., 47., 40., 44., 43.])

Construct a histogram of open credit accounts using 100 bins and limiting the range to 30:

PYTHON

# let's plot a histogram to get familiar with the distribution of the variable

fig = data['number_open_accounts'].hist(bins=100)

# for better visualisation, only specific range in the x-axis is displayed
fig.set_xlim(0, 30)

# title and axis labels
fig.set_title('Number of open accounts')
fig.set_xlabel('Number of open accounts')
fig.set_ylabel('Number of Customers')

Unlike continuous distributions, the discrete nature of accounts results in a plot with visible gaps between distinct integer bars:

Histogram showing discrete counts of open credit accounts

Similarly, inspect the number of installment accounts opened in the past 12 months (number_credit_lines_12):

PYTHON

# let's inspect the variable values

data['number_credit_lines_12'].unique()

OUTPUT

array([nan,  2.,  4.,  1.,  0.,  3.,  5.,  6.])

Plot the number of installment accounts using 50 histogram bins:

PYTHON

# let's make a histogram to get familiar with the distribution of the variable

fig = data['number_credit_lines_12'].hist(bins=50)

# title and axis labels
fig.set_title('Number of installment accounts opened in past 12 months')
fig.set_xlabel('Number of installment accounts opened in past 12 months')
fig.set_ylabel('Number of Borrowers')

The plot clearly shows gaps representing integer gaps, with most customers having zero to two accounts:

Histogram of discrete installment accounts opened in the past year

Binary Variables

A binary variable is a special case of discrete variables that can only take two values, such as 0 or 1. Inspect the target column, which indicates whether a loan defaulted (0) or was repaid (1):

PYTHON

data['target'].unique()

OUTPUT

array([0, 1], dtype=int64)

Plot the distribution of the binary target variable:

PYTHON

# let's make a histogram, although histograms for binary variables do not make a lot of sense

fig = data['target'].hist()

#only specific range in the x-axis is displayed
fig.set_xlim(0, 2)

# title and axis labels
fig.set_title('Defaulted accounts')
fig.set_xlabel('Defaulted')
fig.set_ylabel('Number of Loans')

The histogram shows two bars corresponding to the binary classes 0 and 1:

Histogram of binary loan default target distribution

Categorical Variables

A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values. The values of a categorical variable are selected from a group of categories, also called labels. Examples are gender (male or female) and marital status (never married, married, divorced or widowed).

Other examples of categorical variables include:

Intended use of loan (debt-consolidation, car purchase, wedding expenses, ...)
Mobile network provider (Vodafone, Orange, ...)
Postcode

Categorical variables can be further categorised into:

Ordinal Variables
Nominal variables

Ordinal Variable

Ordinal variables are categorical variable in which the categories can be meaningfully ordered.

For example:

Student's grade in an exam (A, B, C or Fail).
Days of the week, where Monday = 1 and Sunday = 7.
Educational level, with the categories Elementary school, High school, College graduate and PhD ranked from 1 to 4.

Nominal Variable

For nominal variables, there isn't an intrinsic order in the labels. For example, country of birth, with values Argentina, England, Germany, etc., is nominal.

Other examples of nominal variables include:

Car colour (blue, grey, silver, ...)
Vehicle make (Citroen, Peugeot, ...)
City (Manchester, London, Chester, ...)

There is nothing that indicates an intrinsic order of the labels, and in principle, they are all equal.

Note:-

Sometimes categorical variables are coded as numbers when the data are recorded (e.g. gender may be coded as 0 for males and 1 for females). The variable is still categorical, despite the use of numbers.

In a similar way, individuals in a survey may be coded with a number that uniquely identifies them (maybe to avoid storing personal information for confidentiality). This number is really a label, and the variable is categorical. The number has no meaning other than making it possible to uniquely identify the observation (in this case the interviewed subject).

Ideally, when we work with a dataset in a business scenario, the data will come with a dictionary that indicates if the numbers in the variables are to be considered as categories or if they are numerical. And if the numbers are categories, the dictionary would explain what each value in the variable represents.

Check the unique categories in the householder column, which indicates home ownership status:

PYTHON

data['householder'].unique()

OUTPUT

array(['RENT', 'OWNER', 'MORTGAGE'], dtype=object)

Count the number of customers in each home ownership category and generate a bar plot:

PYTHON

fig = data['householder'].value_counts().plot.bar()
fig.set_title('Householder')
fig.set_ylabel('Number of customers')

PYTHON

Text(0, 0.5, 'Number of customers')

The bar plot displays customer frequencies across Rent, Owner, and Mortgage categories:

Bar plot showing home ownership categories of borrowers

You can verify the exact counts for each category using the value_counts() method:

PYTHON

data['householder'].value_counts()

OUTPUT

MORTGAGE    4957
RENT        4055
OWNER        988
Name: householder, dtype: int64

Check the unique values of loan_purpose, another nominal variable representing how the borrower intends to use the funds:

PYTHON

data['loan_purpose'].unique()

OUTPUT

array(['Debt consolidation', 'Car purchase', 'Other', 'Home improvements',
       'Moving home', 'Health', 'Holidays', 'Wedding'], dtype=object)

Create a bar plot showing the counts of loans by their purpose:

PYTHON

fig = data['loan_purpose'].value_counts().plot.bar()
fig.set_title('Loan Purpose')
fig.set_ylabel('Number of customers')

PYTHON

Text(0, 0.5, 'Number of customers')

The bar plot shows that debt consolidation is by far the most common loan purpose:

Bar plot showing loan purpose categories

Now examine the market variable, which represents the risk rating band. This is an ordinal variable because risk bands have a logical progression:

PYTHON

data['market'].unique()

OUTPUT

array(['C', 'B', 'A', 'E', 'D'], dtype=object)

Plot the distribution of borrowers across different risk markets:

PYTHON

fig = data['market'].value_counts().plot.bar()
fig.set_title('Status of the Loan')
fig.set_ylabel('Number of customers')

PYTHON

Text(0, 0.5, 'Number of customers')

The bar plot displays customer frequencies across risk categories A through E:

Bar plot showing risk market categories

Finally, inspect the first five values of customer_id. Although these are numbers, they serve as unique identifiers (labels) and are functionally categorical:

PYTHON

data['customer_id'].head()

OUTPUT

0    0
1    1
2    2
3    3
4    4
Name: customer_id, dtype: int64

Count the number of unique customer IDs:

PYTHON

len(data['customer_id'].unique())

OUTPUT

Date and Time Variables

A special type of categorical variable are those that instead of taking traditional labels, like color (blue, red), or city (London, Manchester), take dates and / or time as values. For example, date of birth ('29-08-1987', '12-01-2012'), or date of application ('2016-Dec', '2013-March').

Datetime variables can contain dates only, time only, or date and time.

We don't usually work with a datetime variable in their raw format because:

Date variables contain a huge number of different categories
We can extract much more information from datetime variables by preprocessing them correctly

In addition, often, date variables will contain dates that were not present in the dataset used to train the machine learning model. In fact, date variables will usually contains dates from the future, respect to the dates in the training dataset. Therefore, the machine learning model will not know what to do with them, because it never saw them while being trained.

Pandas initially reads dates as strings (object type):

PYTHON

data[['date_issued', 'date_last_payment']].dtypes

OUTPUT

date_issued          object
date_last_payment    object
dtype: object

Convert the raw string columns into datetime objects using to_datetime():

PYTHON

data['date_issued_dt'] = pd.to_datetime(data['date_issued'])
data['date_last_payment_dt'] = pd.to_datetime(data['date_last_payment'])
data[['date_issued', 'date_issued_dt', 'date_last_payment', 'date_last_payment_dt']].head()

OUTPUT

	date_issued	date_issued_dt	date_last_payment	date_last_payment_dt
0	2013-06-11	2013-06-11	2016-01-14	2016-01-14
1	2014-05-08	2014-05-08	2016-01-25	2016-01-25
2	2013-10-26	2013-10-26	2014-09-26	2014-09-26
3	2015-08-20	2015-08-20	2016-01-26	2016-01-26
4	2014-07-22	2014-07-22	2016-01-11	2016-01-11

Verify that the columns have been cast to the datetime type:

PYTHON

data[['date_issued_dt', 'date_last_payment_dt']].dtypes

OUTPUT

date_issued_dt          datetime64[ns]
date_last_payment_dt    datetime64[ns]
dtype: object

Extract the month and year from the parsed datetime columns:

PYTHON

data['month'] = data['date_issued_dt'].dt.month
data['year'] = data['date_issued_dt'].dt.year

Calculate the sum of disbursed amounts grouped by year, month, and risk market, and plot the trends over time:

PYTHON

fig = data.groupby(['year','month', 'market'])['disbursed_amount'].sum().unstack().plot(
    figsize=(14, 8), linewidth=2)

fig.set_title('Disbursed amount in time')
fig.set_ylabel('Disbursed Amount')

The time-series plot shows lending amounts growing over time, driven by grades B and C:

Line plot showing total disbursed loan amount over time grouped by risk market

This toy finance company seems to have increased the amount of money lent from 2012 onwards. The tendency indicates that they continue to grow. In addition, we can see that their major business comes from lending money to C and B grades.

A grades are the lower risk borrowers, borrowers that most likely will be able to repay their loans, as they are typically in a better financial situation. Borrowers within this grade are charged lower interest rates.

D and E grades represent the riskier borrowers. Usually borrowers in somewhat tighter financial situations, or for whom there is not sufficient financial history to make a reliable credit assessment. They are typically charged higher rates, as the business, and therefore the investors, take a higher risk when lending them money.

Mixed Variables

Mixed variables are those which values contain both numbers and labels.

Variables can be mixed for a variety of reasons. For example, when credit agencies gather and store financial information of users, usually, the values of the variables they store are numbers. However, in some cases the credit agencies cannot retrieve information for a certain user for different reasons. What Credit Agencies do in these situations is to code each different reason due to which they failed to retrieve information with a different code or 'label'. While doing so they generate mixed type variables. These variables contain numbers when the value could be retrieved, or labels otherwise.

As an example, think of the variable number_of_open_accounts. It can take any number, representing the number of different financial accounts of the borrower. Sometimes, information may not be available for a certain borrower, for a variety of reasons. Each reason will be coded by a different letter, for example: 'A': couldn't identify the person, 'B': no relevant data, 'C': person seems not to have any open account.

Another example of mixed type variables, is for example the variable missed_payment_status. This variable indicates, whether a borrower has missed a (any) payment in their financial item. For example, if the borrower has a credit card, this variable indicates whether they missed a monthly payment on it. Therefore, this variable can take values of 0, 1, 2, 3 meaning that the customer has missed 0-3 payments in their account. And it can also take the value D, if the customer defaulted on that account. Typically, once the customer has missed 3 payments, the lender declares the item defaulted (D), that is why this variable takes numerical values 0-3 and then D.

Mixed Variables Dataset

The dataset used to study Mixed Variables contains 2 columns id and open_il_24m. id is the unique key for each borrower and open_il_24m is the number of installment accounts opened in past 24 months by that borrower. Installment accounts are those that, at the moment of acquiring them, there is a set period and amount of repayments agreed between the lender and borrower. An example of this is a car loan, or a student loan. The borrowers know that they are going to pay a fixed amount over a fixed period.

We will begin by reading the dataset into a dataframe.

PYTHON

data = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/feature-engineering-for-machine-learning-dataset/master/sample_s2.csv')
data.head()

OUTPUT

	id	open_il_24m
0	1077501	C
1	1077430	A
2	1077175	A
3	1076863	A
4	1075358	A

Check the size of the dataset:

PYTHON

data.shape

OUTPUT

(887379, 2)

Fictitious meaning of the different letters / codes in the variable open_il_24m are:

'A': couldn't identify the person
'B': no relevant data
'C': person seems not to have any account open

Examine the unique values present in the column:

PYTHON

data.open_il_24m.unique()

OUTPUT

array(['C', 'A', 'B', '0.0', '1.0', '2.0', '4.0', '3.0', '6.0', '5.0',
       '9.0', '7.0', '8.0', '13.0', '10.0', '19.0', '11.0', '12.0',
       '14.0', '15.0'], dtype=object)

Display the frequency of each unique mixed value:

PYTHON

data.open_il_24m.value_counts()

OUTPUT

A       500548
C       300000
B        65459
1.0       6436
0.0       5481
2.0       4448
3.0       2468
4.0       1249
5.0        606
6.0        309
7.0        163
8.0         81
9.0         47
10.0        28
11.0        23
12.0        17
13.0         7
14.0         6
15.0         2
19.0         1
Name: open_il_24m, dtype: int64

Generate a bar plot of frequencies for the mixed variable values:

PYTHON

fig = data.open_il_24m.value_counts().plot.bar()
fig.set_title('Number of installment accounts open')
fig.set_ylabel('Number of borrowers')

The resulting bar plot displays both the alphanumeric codes (A, B, C) and the numeric strings:

Bar plot displaying mixed variable value counts

This is how a mixed variable looks like!

Summary

The diagram below provides a visual mindmap summarizing how numerical, categorical, date-time, and mixed variables are structured:

Variables classification mindmap

Conclusion

In this tutorial, you examined the four main categories of data variables: numerical (discrete and continuous), categorical (nominal and ordinal), date-time, and mixed variables. Using real-world finance datasets, you loaded, cleaned, parsed, and visualized columns using Matplotlib histograms and bar plots to understand data structures.

Key takeaways:

Numerical Variables: Numeric data can be continuous (values vary smoothly over a range, like income) or discrete (values represent distinct integer counts, like number of open accounts).
Categorical Variables: Categories can be nominal (having no inherent order, like loan purpose) or ordinal (having a logical sequence, like risk grade bands).
Date-Time Variables: String representation of dates must be parsed into proper datetime formats (datetime64[ns]) before you can perform time-series aggregation or trend analysis.
Mixed Variables: Columns containing both numerical measurements and categorical reason labels (like credit status codes) must be identified and preprocessed carefully before modeling.

Next steps:

Get a complete hands-on introduction to data structures with Pandas Crash Course to master loading, filtering, and mutating data tables.
Learn how to visualize relationships between variables directly from DataFrames in Data Visualization with Pandas.
Master plotting basics with Matplotlib Crash Course to style and customize your plots.

Data Variable Types Every Data Scientist Needs

Topics You Will Master

What Is a Variable?

Numeric Variables

Discrete Variable

Continuous Variable

Demo Dataset

Setup

Continuous Variables

Discrete Variables

Binary Variables

Categorical Variables

Ordinal Variable

Nominal Variable

Date and Time Variables

Mixed Variables

Mixed Variables Dataset

Summary

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments