Pandas Crash Course

When working on a data science project, you spend most of your time cleaning and preprocessing data. Choosing the right library to manipulate tabular data efficiently is crucial for a smooth workflow.

Pandas is the standard open-source library for data analysis and manipulation in Python. Built on top of NumPy, it introduces the DataFrame—a highly optimized two-dimensional tabular data structure with labeled axes (rows and columns) that makes data manipulation fast and intuitive.

In this tutorial, you will build a complete data cleaning and analysis pipeline. You will learn to construct DataFrames from scratch, import real-world datasets like NBA player stats and IMDB movie rankings, handle duplicate rows, impute missing values using statistical means, and analyze variable relationships with correlation.

Prerequisites: Python 3.x, Pandas, NumPy, Seaborn, Matplotlib.

Datasets

You can download all the datasets used in this tutorial from the All CSV ML Data Files Download repository.

Getting Started

To begin working with the library, you must first import the module under its standard alias:

PYTHON

import pandas as pd

Prepare a Python dictionary containing lists of values to represent orange and apple counts:

PYTHON

data = {
    'apple': [3,1,4,5],
    'orange': [1, 5, 6, 8]
}

data

OUTPUT

{'apple': [3, 1, 4, 5], 'orange': [1, 5, 6, 8]}

Check the data type of your newly created dictionary:

PYTHON

type(data)

OUTPUT

dict

Convert your Python dictionary into a Pandas DataFrame. A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. It aligns data in a tabular fashion using rows and columns:

PYTHON

df = pd.DataFrame(data)
df

OUTPUT

	apple	orange
0	3	1
1	1	5
2	4	6
3	5	8

Extract a single column to see how Pandas represents individual columns as Series. A Series is a one-dimensional array-like object that contains a sequence of values:

PYTHON

df['apple']

OUTPUT

0    3
1    1
2    4
3    5
Name: apple, dtype: int64

Verify that the type of the extracted column is indeed a Pandas Series:

PYTHON

type(df['apple'])

PYTHON

pandas.core.series.Series

Reading CSV Files

To load external datasets, Pandas provides robust input/output functions. For example, read_csv() imports a comma-separated values file directly into a DataFrame:

PYTHON

df = pd.read_csv('nba.csv')

Display the first ten rows of the DataFrame using head() to examine the dataset structure:

PYTHON

df.head(10)

OUTPUT

	Name	Team	Number	Position	Age	Height	Weight	College	Salary
0	Avery Bradley	Boston Celtics	0.0	PG	25.0	6-2	180.0	Texas	7730337.0
1	Jae Crowder	Boston Celtics	99.0	SF	25.0	6-6	235.0	Marquette	6796117.0
2	John Holland	Boston Celtics	30.0	SG	27.0	6-5	205.0	Boston University	NaN
3	R.J. Hunter	Boston Celtics	28.0	SG	22.0	6-5	185.0	Georgia State	1148640.0
4	Jonas Jerebko	Boston Celtics	8.0	PF	29.0	6-10	231.0	NaN	5000000.0
5	Amir Johnson	Boston Celtics	90.0	PF	29.0	6-9	240.0	NaN	12000000.0
6	Jordan Mickey	Boston Celtics	55.0	PF	21.0	6-8	235.0	LSU	1170960.0
7	Kelly Olynyk	Boston Celtics	41.0	C	25.0	7-0	238.0	Gonzaga	2165160.0
8	Terry Rozier	Boston Celtics	12.0	PG	22.0	6-2	190.0	Louisville	1824360.0
9	Marcus Smart	Boston Celtics	36.0	PG	22.0	6-4	220.0	Oklahoma State	3431040.0

View the last two rows of the dataset using tail() to verify row alignments at the end of the file:

PYTHON

df.tail(2)

OUTPUT

	Name	Team	Number	Position	Age	Height	Weight	College	Salary
456	Jeff Withey	Utah Jazz	24.0	C	26.0	7-0	231.0	Kansas	947276.0
457	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Load the CSV file while specifying the Name column to act as the row labels or index of the DataFrame:

PYTHON

df = pd.read_csv('nba.csv', index_col = 'Name')
df.head()

OUTPUT

Name	Team	Number	Position	Age	Height	Weight	College	Salary
Avery Bradley	Boston Celtics	0.0	PG	25.0	6-2	180.0	Texas	7730337.0
Jae Crowder	Boston Celtics	99.0	SF	25.0	6-6	235.0	Marquette	6796117.0
John Holland	Boston Celtics	30.0	SG	27.0	6-5	205.0	Boston University	NaN
R.J. Hunter	Boston Celtics	28.0	SG	22.0	6-5	185.0	Georgia State	1148640.0
Jonas Jerebko	Boston Celtics	8.0	PF	29.0	6-10	231.0	NaN	5000000.0

You can also use custom row labels for other datasets. For example, load the IMDB Movie dataset using the movie's rank as the index:

PYTHON

df = pd.read_csv('IMDB-Movie-Data.csv', index_col = 'Rank')
df.head()

OUTPUT

Rank	Title	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced ...	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...	2014	121	8.1	757074	333.13	76.0
2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te...	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa...	2012	124	7.0	485820	126.46	65.0
3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag...	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar...	2016	117	7.3	157606	138.12	62.0
4	Sing	Animation,Comedy,Family	In a city of humanoid animals, a hustling thea...	Christophe Lourdelet	Matthew McConaughey,Reese Witherspoon, Seth Ma...	2016	108	7.2	60545	270.32	59.0
5	Suicide Squad	Action,Adventure,Fantasy	A secret government agency recruits some of th...	David Ayer	Will Smith, Jared Leto, Margot Robbie, Viola D...	2016	123	6.2	393727	325.02	40.0

Display the last five rows of the IMDB dataset to verify the tail structure:

PYTHON

df.tail()

OUTPUT

Rank	Title	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
996	Secret in Their Eyes	Crime,Drama,Mystery	A tight-knit team of rising investigators, alo...	Billy Ray	Chiwetel Ejiofor, Nicole Kidman, Julia Roberts...	2015	111	6.2	27585	NaN	45.0
997	Hostel: Part II	Horror	Three American college students studying abroa...	Eli Roth	Lauren German, Heather Matarazzo, Bijou Philli...	2007	94	5.5	73152	17.54	46.0
998	Step Up 2: The Streets	Drama,Music,Romance	Romantic sparks occur between two dance studen...	Jon M. Chu	Robert Hoffman, Briana Evigan, Cassie Ventura,...	2008	98	6.2	70699	58.01	50.0
999	Search Party	Adventure,Comedy	A pair of friends embark on a mission to reuni...	Scot Armstrong	Adam Pally, T.J. Miller, Thomas Middleditch,Sh...	2014	93	5.6	4881	NaN	22.0
1000	Nine Lives	Comedy,Family,Fantasy	A stuffy businessman finds himself trapped ins...	Barry Sonnenfeld	Kevin Spacey, Jennifer Garner, Robbie Amell,Ch...	2016	87	5.3	12435	19.64	11.0

To inspect the schema, data types, and non-null counts of each column, call the info() method:

PYTHON

df.info()

PYTHON

Int64Index: 1000 entries, 1 to 1000
Data columns (total 11 columns):
Title                 1000 non-null object
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(3), object(5)
memory usage: 93.8+ KB

To check the dimensionality of the DataFrame, inspect the shape attribute:

PYTHON

df.shape

OUTPUT

(1000, 11)

You can count duplicate rows by combining the duplicated() method with a sum operation:

PYTHON

sum(df.duplicated())

OUTPUT

To demonstrate duplicate handling, append the DataFrame to itself to double the number of rows:

PYTHON

df1 = df.append(df)
df1.shape

OUTPUT

(2000, 11)

Verify the number of duplicate rows in the newly concatenated DataFrame:

PYTHON

df1.duplicated().sum()

OUTPUT

Remove the duplicate rows using drop_duplicates() to return a cleaned copy:

PYTHON

df2 = df1.drop_duplicates()
df2.shape

OUTPUT

(1000, 11)

Note that the original DataFrame remains unchanged unless you assign the result back or modify it in place:

PYTHON

df1.shape

OUTPUT

(2000, 11)

To modify the DataFrame directly without creating a new copy, set the inplace parameter to True:

PYTHON

df1.drop_duplicates(inplace = True)
df1.shape

OUTPUT

(1000, 11)

Column Operations

Access the column labels of the DataFrame using the columns attribute:

PYTHON

df.columns

OUTPUT

Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

Count the number of columns in the DataFrame:

PYTHON

len(df.columns)

OUTPUT

Generate descriptive statistics for the numerical columns using the describe() method:

PYTHON

df.describe()

OUTPUT

	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
count	1000.000000	1000.000000	1000.000000	1.000000e+03	872.000000	936.000000
mean	2012.783000	113.172000	6.723200	1.698083e+05	82.956376	58.985043
std	3.205962	18.810908	0.945429	1.887626e+05	103.253540	17.194757
min	2006.000000	66.000000	1.900000	6.100000e+01	0.000000	11.000000
25%	2010.000000	100.000000	6.200000	3.630900e+04	13.270000	47.000000
50%	2014.000000	111.000000	6.800000	1.107990e+05	47.985000	59.500000
75%	2016.000000	123.000000	7.400000	2.399098e+05	113.715000	72.000000
max	2016.000000	191.000000	9.000000	1.791916e+06	936.630000	100.000000

Extract the column names and convert them into a standard Python list:

PYTHON

col = df.columns
type(list(col))

OUTPUT

list

Display the index object containing all column names:

PLAINTEXT

col

PLAINTEXT

Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

You can rename columns by assigning a list of new names directly to the columns attribute:

PYTHON

col1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
df.columns = col1
df.head()

OUTPUT

Rank	a	b	c	d	e	f	g	h	i	j	k
1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced ...	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...	2014	121	8.1	757074	333.13	76.0
2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te...	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa...	2012	124	7.0	485820	126.46	65.0
3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag...	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar...	2016	117	7.3	157606	138.12	62.0
4	Sing	Animation,Comedy,Family	In a city of humanoid animals, a hustling thea...	Christophe Lourdelet	Matthew McConaughey,Reese Witherspoon, Seth Ma...	2016	108	7.2	60545	270.32	59.0
5	Suicide Squad	Action,Adventure,Fantasy	A secret government agency recruits some of th...	David Ayer	Will Smith, Jared Leto, Margot Robbie, Viola D...	2016	123	6.2	393727	325.02	40.0

Revert back to the original column names:

PYTHON

df.columns = col
df.head(0)

OUTPUT

Rank	Title	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore

For more precise renaming, use the rename() method with a dictionary mapping old column names to new ones:

PYTHON

df.rename(columns={
    'Runtime (Minutes)': 'Runtime',
    'Revenue (Millions)': 'Revenue'
}, inplace= True)

df.columns

OUTPUT

Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue', 'Metascore'],
      dtype='object')

Compare the updated column list with the original stored in the col variable:

PLAINTEXT

col

PLAINTEXT

Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

Missing Values

Real-world datasets often contain missing values. You can represent these missing entries in Python using NumPy's nan constant:

PYTHON

import numpy as np

Access the standard representation for empty values:

PYTHON

np.nan

OUTPUT

nan

Identify missing values in the DataFrame using isnull(), which returns boolean values indicating the presence of null values:

PYTHON

df.isnull().head(10)

OUTPUT

Rank	Title	Genre	Description	Director	Actors	Year	Runtime	Rating	Votes	Revenue	Metascore
1	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False
5	False	False	False	False	False	False	False	False	False	False	False
6	False	False	False	False	False	False	False	False	False	False	False
7	False	False	False	False	False	False	False	False	False	False	False
8	False	False	False	False	False	False	False	False	False	True	False
9	False	False	False	False	False	False	False	False	False	False	False
10	False	False	False	False	False	False	False	False	False	False	False

Sum the boolean values to count the number of missing records in each column:

PYTHON

df.isnull().sum()

OUTPUT

Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

Alternatively, call isna() to achieve the same result:

PYTHON

df.isna().sum()

OUTPUT

Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

Remove any rows containing at least one missing value using dropna():

PYTHON

df1 = df.dropna()
df1.shape

OUTPUT

(838, 11)

To drop columns that contain missing values instead of rows, set axis=1:

PYTHON

df2 = df.dropna(axis = 1)
df2.head(3)

OUTPUT

Rank	Title	Genre	Description	Director	Actors	Year	Runtime	Rating	Votes
1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced ...	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...	2014	121	8.1	757074
2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te...	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa...	2012	124	7.0	485820
3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag...	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar...	2016	117	7.3	157606

Imputation

Rather than deleting rows or columns, you can fill missing cells with a specific value. Fill all missing entries with 0 using fillna():

PYTHON

df3 = df.fillna(0)
df3.isna().sum()

OUTPUT

Title          0
Genre          0
Description    0
Director       0
Actors         0
Year           0
Runtime        0
Rating         0
Votes          0
Revenue        0
Metascore      0
dtype: int64

Recall the counts of null values in the original DataFrame:

PYTHON

df.isnull().sum()

OUTPUT

Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

Extract the Revenue column as a Pandas Series:

PYTHON

revenue = df['Revenue']
type(revenue)

PYTHON

pandas.core.series.Series

Display the last five elements of the revenue Series to check for missing entries:

PYTHON

revenue.tail()

OUTPUT

Rank
996       NaN
997     17.54
998     58.01
999       NaN
1000    19.64
Name: Revenue, dtype: float64

Compute the average revenue to use for imputation:

PYTHON

revenue_mean = revenue.mean()
revenue_mean

OUTPUT

82.95637614678897

Fill the missing entries in revenue with its calculated mean value in-place:

PYTHON

revenue.fillna(revenue_mean, inplace= True)
revenue.tail()

OUTPUT

Rank
996     82.956376
997     17.540000
998     58.010000
999     82.956376
1000    19.640000
Name: Revenue, dtype: float64

Confirm that the revenue Series no longer contains null values:

PYTHON

revenue.isnull().sum()

OUTPUT

Assign the imputed Series back to the DataFrame and check the remaining null counts:

PYTHON

df['Revenue'] = revenue
df.isnull().sum()

OUTPUT

Title           0
Genre           0
Description     0
Director        0
Actors          0
Year            0
Runtime         0
Rating          0
Votes           0
Revenue         0
Metascore      64
dtype: int64

Apply the same mean-imputation process to the Metascore column:

PYTHON

metascore = df['Metascore']
metascore_mean = metascore.mean()
print("metascore_mean =",metascore_mean)
metascore.fillna(metascore_mean, inplace = True)
df['Metascore'] = metascore
df.isnull().sum()

OUTPUT

metascore_mean = 58.98504273504273

Title          0
Genre          0
Description    0
Director       0
Actors         0
Year           0
Runtime        0
Rating         0
Votes          0
Revenue        0
Metascore      0
dtype: int64

Verify the statistical summary of the numerical columns now that the missing values are imputed:

PYTHON

df.describe()

OUTPUT

	Year	Runtime	Rating	Votes	Revenue	Metascore
count	1000.000000	1000.000000	1000.000000	1.000000e+03	1000.000000	1000.000000
mean	2012.783000	113.172000	6.723200	1.698083e+05	82.956376	58.985043
std	3.205962	18.810908	0.945429	1.887626e+05	96.412043	16.634858
min	2006.000000	66.000000	1.900000	6.100000e+01	0.000000	11.000000
25%	2010.000000	100.000000	6.200000	3.630900e+04	17.442500	47.750000
50%	2014.000000	111.000000	6.800000	1.107990e+05	60.375000	58.985043
75%	2016.000000	123.000000	7.400000	2.399098e+05	99.177500	71.000000
max	2016.000000	191.000000	9.000000	1.791916e+06	936.630000	100.000000

Print the updated info schema to confirm all columns have 1000 non-null values:

PYTHON

df.info()

OUTPUT

Int64Index: 1000 entries, 1 to 1000
Data columns (total 11 columns):
Title          1000 non-null object
Genre          1000 non-null object
Description    1000 non-null object
Director       1000 non-null object
Actors         1000 non-null object
Year           1000 non-null int64
Runtime        1000 non-null int64
Rating         1000 non-null float64
Votes          1000 non-null int64
Revenue        1000 non-null float64
Metascore      1000 non-null float64
dtypes: float64(3), int64(3), object(5)
memory usage: 93.8+ KB

Get a statistical summary of the non-numerical Genre column:

PYTHON

df['Genre'].describe()

OUTPUT

count                        1000
unique                        207
top       Action,Adventure,Sci-Fi
freq                           50
Name: Genre, dtype: object

Count the frequencies of the top ten movie genres:

PYTHON

df['Genre'].value_counts().head(10)

OUTPUT

Action,Adventure,Sci-Fi       50
Drama                         48
Comedy,Drama,Romance          35
Comedy                        32
Drama,Romance                 31
Comedy,Drama                  27
Action,Adventure,Fantasy      27
Animation,Adventure,Comedy    27
Comedy,Romance                26
Crime,Drama,Thriller          24
Name: Genre, dtype: int64

Check the count of unique movie genres present in the dataset:

PYTHON

len(df['Genre'].unique())

OUTPUT

Correlation Analysis

Compute the pairwise Pearson correlation matrix for the numerical columns using the corr() method:

PYTHON

corrmat = df.corr()
corrmat

OUTPUT

	Year	Runtime	Rating	Votes	Revenue	Metascore
Year	1.000000	-0.164900	-0.211219	-0.411904	-0.117562	-0.076077
Runtime	-0.164900	1.000000	0.392214	0.407062	0.247834	0.202239
Rating	-0.211219	0.392214	1.000000	0.511537	0.189527	0.604723
Votes	-0.411904	0.407062	0.511537	1.000000	0.607941	0.318116
Revenue	-0.117562	0.247834	0.189527	0.607941	1.000000	0.132304
Metascore	-0.076077	0.202239	0.604723	0.318116	0.132304	1.000000

You can compute relationships between numerical attributes using correlation matrices. To visualize these correlation matrices as colors, import the Seaborn library:

PYTHON

import seaborn as sns

Plot the correlation matrix as a color-encoded heat map:

PYTHON

sns.heatmap(corrmat)

You can also use Matplotlib to generate visualizations directly from DataFrames. Import the Matplotlib pyplot interface:

PYTHON

import matplotlib.pyplot as plt

Generate a scatter plot of movie ratings versus revenues:

PYTHON

df.plot(kind = 'scatter', x = 'Rating', y = 'Revenue', title = 'Revenue vs Rating')

Create a histogram plot to view the frequency distribution of ratings:

PYTHON

df['Rating'].plot(kind = 'hist', title = 'Rating')

Estimate the probability density of ratings using a Kernel Density Estimation (KDE) plot:

PYTHON

df['Rating'].plot(kind = 'kde', title = 'Rating')

Count the frequency of each rating category:

PYTHON

df['Rating'].value_counts()

OUTPUT

7.1    52
6.7    48
7.0    46
6.3    44
6.6    42
7.2    42
7.3    42
6.5    40
7.8    40
6.2    37
6.8    37
7.5    35
6.4    35
7.4    33
6.9    31
6.1    31
7.6    27
7.7    27
5.8    26
6.0    26
8.1    26
7.9    23
5.7    21
8.0    19
5.9    19
5.6    17
5.5    14
5.3    12
5.4    12
5.2    11
8.2    10
4.9     7
8.3     7
4.7     6
8.5     6
4.6     5
5.1     5
5.0     4
4.8     4
4.3     4
8.4     4
3.9     3
8.6     3
8.8     2
2.7     2
4.2     2
3.5     2
3.7     2
9.0     1
3.2     1
4.0     1
4.5     1
4.4     1
4.1     1
1.9     1
Name: Rating, dtype: int64

To visualize the distribution, spread, and outliers of ratings, you can construct a box plot. Box plots represent the five-number summary of a dataset: the minimum, first quartile ( $Q_{1}$ ), median, third quartile ( $Q_{3}$ ), and maximum:

A standard box plot anatomy diagram showing minimum, lower quartile, median, upper quartile, maximum, and whiskers

Generate a box plot of movie ratings:

PYTHON

df['Rating'].plot(kind = 'box')

Compute numerical statistics for movie ratings to match the box plot coordinates:

PYTHON

df['Rating'].describe()

OUTPUT

count    1000.000000
mean        6.723200
std         0.945429
min         1.900000
25%         6.200000
50%         6.800000
75%         7.400000
max         9.000000
Name: Rating, dtype: float64

Add a new rating category column to label movies as good or bad depending on their rating threshold:

PYTHON

rating_cat = []
for rate in df['Rating']:
    if rate > 6.2:
        rating_cat.append('Good')
    else:
        rating_cat.append('Bad')

rating_cat[:20]

OUTPUT

['Good', 'Good', 'Good', 'Good', 'Bad', 'Bad', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good']

Assign the list to the DataFrame and display the head to confirm the column addition:

PYTHON

df['Rating Category'] = rating_cat
df.head()

OUTPUT

Rank	Title	Genre	Description	Director	Actors	Year	Runtime	Rating	Votes	Revenue	Metascore	Rating Category
1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced ...	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...	2014	121	8.1	757074	333.13	76.0	Good
2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te...	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa...	2012	124	7.0	485820	126.46	65.0	Good
3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag...	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar...	2016	117	7.3	157606	138.12	62.0	Good
4	Sing	Animation,Comedy,Family	In a city of humanoid animals, a hustling thea...	Christophe Lourdelet	Matthew McConaughey,Reese Witherspoon, Seth Ma...	2016	108	7.2	60545	270.32	59.0	Good
5	Suicide Squad	Action,Adventure,Fantasy	A secret government agency recruits some of th...	David Ayer	Will Smith, Jared Leto, Margot Robbie, Viola D...	2016	123	6.2	393727	325.02	40.0	Bad

Group your data by a categorical feature and draw box plots for each group using the boxplot() method:

PYTHON

df.boxplot(column= 'Revenue', by = 'Rating Category')

Conclusion

In this tutorial, you explored the core functionalities of the Pandas library, including DataFrame creation, duplicate removal, column renaming, and handling missing values using mean imputation. Using the NBA and IMDB datasets, you walked through essential data cleaning steps and calculated pairwise correlation to assess linear dependencies between numeric variables.

Key takeaways:

DataFrames and Series: A DataFrame is a two-dimensional tabular structure, while a Series represents a single column within that DataFrame.
Wrangling and Deduplication: Tabular data often contains duplicates that can be filtered out using drop_duplicates(inplace=True) to clean the workspace.
Handling Null Values: Missing data (NaN) should either be dropped or imputed using operations like fillna() with statistical metrics such as the column mean.
Correlation: The corr() method helps calculate pairwise relationship coefficients to find dependencies before building predictive models.

Next steps:

Dive deeper into visual analysis with Data Visualization with Pandas to learn how to plot line, bar, scatter, and area graphs directly from DataFrames.
Master plotting basics with Matplotlib Crash Course to customize your plots with titles, axes, and styles.
Learn about the different kinds of variables you will encounter in Data Variable Types Every Data Scientist Needs to choose appropriate visualization tools.

Topics You Will Master

Datasets

Getting Started

Reading CSV Files

Column Operations

Missing Values

Imputation

Correlation Analysis

Conclusion

Latest recommendations you might like

LinkedIn Auto Connect Bot

Dimensionality Reduction with LDA and PCA in Python

Find this tutorial useful?

Discussion & Comments