#dataframe#pandas#python#data science#data manipulation

Pandas Crash Course

Learn the fundamentals of pandas DataFrames, loading CSVs, column operations, handling missing values, mean imputation, and correlation analysis.

May 23, 2026 at 3:45 PM14 min readFollowFollow (Hindi)

Topics You Will Master

Creating and inspecting Pandas DataFrames and Series
Loading CSV files with custom row indices using NBA and IMDB datasets
Appending rows, dropping duplicates, and renaming columns in-place
Identifying and imputing missing data (NaN) with column mean values
Computing and analyzing correlation matrices of numerical variables
Best For

Python developers and data science beginners seeking a practical, hands-on introduction to data wrangling and exploratory data analysis.

Expected Outcome

A clean and structured workflow to load datasets, handle duplicates and missing values, and prepare structured tables for statistical analysis.

When working on a data science project, you spend most of your time cleaning and preprocessing data. Choosing the right library to manipulate tabular data efficiently is crucial for a smooth workflow.

Pandas is the standard open-source library for data analysis and manipulation in Python. Built on top of NumPy, it introduces the DataFrame—a highly optimized two-dimensional tabular data structure with labeled axes (rows and columns) that makes data manipulation fast and intuitive.

In this tutorial, you will build a complete data cleaning and analysis pipeline. You will learn to construct DataFrames from scratch, import real-world datasets like NBA player stats and IMDB movie rankings, handle duplicate rows, impute missing values using statistical means, and analyze variable relationships with correlation.

Prerequisites: Python 3.x, Pandas, NumPy, Seaborn, Matplotlib.

Datasets

You can download all the datasets used in this tutorial from the All CSV ML Data Files Download repository.

Getting Started

To begin working with the library, you must first import the module under its standard alias:

PYTHON
import pandas as pd

Prepare a Python dictionary containing lists of values to represent orange and apple counts:

PYTHON
data = {
    'apple': [3,1,4,5],
    'orange': [1, 5, 6, 8]
}

data
OUTPUT
{'apple': [3, 1, 4, 5], 'orange': [1, 5, 6, 8]}

Check the data type of your newly created dictionary:

PYTHON
type(data)
OUTPUT
dict

Convert your Python dictionary into a Pandas DataFrame. A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure. It aligns data in a tabular fashion using rows and columns:

PYTHON
df = pd.DataFrame(data)
df
OUTPUT
appleorange
031
115
246
358

Extract a single column to see how Pandas represents individual columns as Series. A Series is a one-dimensional array-like object that contains a sequence of values:

PYTHON
df['apple']
OUTPUT
0    3
1    1
2    4
3    5
Name: apple, dtype: int64

Verify that the type of the extracted column is indeed a Pandas Series:

PYTHON
type(df['apple'])
PYTHON
pandas.core.series.Series

Reading CSV Files

To load external datasets, Pandas provides robust input/output functions. For example, read_csv() imports a comma-separated values file directly into a DataFrame:

PYTHON
df = pd.read_csv('nba.csv')

Display the first ten rows of the DataFrame using head() to examine the dataset structure:

PYTHON
df.head(10)
OUTPUT
NameTeamNumberPositionAgeHeightWeightCollegeSalary
0Avery BradleyBoston Celtics0.0PG25.06-2180.0Texas7730337.0
1Jae CrowderBoston Celtics99.0SF25.06-6235.0Marquette6796117.0
2John HollandBoston Celtics30.0SG27.06-5205.0Boston UniversityNaN
3R.J. HunterBoston Celtics28.0SG22.06-5185.0Georgia State1148640.0
4Jonas JerebkoBoston Celtics8.0PF29.06-10231.0NaN5000000.0
5Amir JohnsonBoston Celtics90.0PF29.06-9240.0NaN12000000.0
6Jordan MickeyBoston Celtics55.0PF21.06-8235.0LSU1170960.0
7Kelly OlynykBoston Celtics41.0C25.07-0238.0Gonzaga2165160.0
8Terry RozierBoston Celtics12.0PG22.06-2190.0Louisville1824360.0
9Marcus SmartBoston Celtics36.0PG22.06-4220.0Oklahoma State3431040.0

View the last two rows of the dataset using tail() to verify row alignments at the end of the file:

PYTHON
df.tail(2)
OUTPUT
NameTeamNumberPositionAgeHeightWeightCollegeSalary
456Jeff WitheyUtah Jazz24.0C26.07-0231.0Kansas947276.0
457NaNNaNNaNNaNNaNNaNNaNNaNNaN

Load the CSV file while specifying the Name column to act as the row labels or index of the DataFrame:

PYTHON
df = pd.read_csv('nba.csv', index_col = 'Name')
df.head()
OUTPUT
NameTeamNumberPositionAgeHeightWeightCollegeSalary
Avery BradleyBoston Celtics0.0PG25.06-2180.0Texas7730337.0
Jae CrowderBoston Celtics99.0SF25.06-6235.0Marquette6796117.0
John HollandBoston Celtics30.0SG27.06-5205.0Boston UniversityNaN
R.J. HunterBoston Celtics28.0SG22.06-5185.0Georgia State1148640.0
Jonas JerebkoBoston Celtics8.0PF29.06-10231.0NaN5000000.0

You can also use custom row labels for other datasets. For example, load the IMDB Movie dataset using the movie's rank as the index:

PYTHON
df = pd.read_csv('IMDB-Movie-Data.csv', index_col = 'Rank')
df.head()
OUTPUT
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
1Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0
2PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0
3SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0
4SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0
5Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0

Display the last five rows of the IMDB dataset to verify the tail structure:

PYTHON
df.tail()
OUTPUT
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
996Secret in Their EyesCrime,Drama,MysteryA tight-knit team of rising investigators, alo...Billy RayChiwetel Ejiofor, Nicole Kidman, Julia Roberts...20151116.227585NaN45.0
997Hostel: Part IIHorrorThree American college students studying abroa...Eli RothLauren German, Heather Matarazzo, Bijou Philli...2007945.57315217.5446.0
998Step Up 2: The StreetsDrama,Music,RomanceRomantic sparks occur between two dance studen...Jon M. ChuRobert Hoffman, Briana Evigan, Cassie Ventura,...2008986.27069958.0150.0
999Search PartyAdventure,ComedyA pair of friends embark on a mission to reuni...Scot ArmstrongAdam Pally, T.J. Miller, Thomas Middleditch,Sh...2014935.64881NaN22.0
1000Nine LivesComedy,Family,FantasyA stuffy businessman finds himself trapped ins...Barry SonnenfeldKevin Spacey, Jennifer Garner, Robbie Amell,Ch...2016875.31243519.6411.0

To inspect the schema, data types, and non-null counts of each column, call the info() method:

PYTHON
df.info()
PYTHON
Int64Index: 1000 entries, 1 to 1000
Data columns (total 11 columns):
Title                 1000 non-null object
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(3), object(5)
memory usage: 93.8+ KB

To check the dimensionality of the DataFrame, inspect the shape attribute:

PYTHON
df.shape
OUTPUT
(1000, 11)

You can count duplicate rows by combining the duplicated() method with a sum operation:

PYTHON
sum(df.duplicated())
OUTPUT
0

To demonstrate duplicate handling, append the DataFrame to itself to double the number of rows:

PYTHON
df1 = df.append(df)
df1.shape
OUTPUT
(2000, 11)

Verify the number of duplicate rows in the newly concatenated DataFrame:

PYTHON
df1.duplicated().sum()
OUTPUT
1000

Remove the duplicate rows using drop_duplicates() to return a cleaned copy:

PYTHON
df2 = df1.drop_duplicates()
df2.shape
OUTPUT
(1000, 11)

Note that the original DataFrame remains unchanged unless you assign the result back or modify it in place:

PYTHON
df1.shape
OUTPUT
(2000, 11)

To modify the DataFrame directly without creating a new copy, set the inplace parameter to True:

PYTHON
df1.drop_duplicates(inplace = True)
df1.shape
OUTPUT
(1000, 11)

Column Operations

Access the column labels of the DataFrame using the columns attribute:

PYTHON
df.columns
OUTPUT
Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

Count the number of columns in the DataFrame:

PYTHON
len(df.columns)
OUTPUT
11

Generate descriptive statistics for the numerical columns using the describe() method:

PYTHON
df.describe()
OUTPUT
YearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore
count1000.0000001000.0000001000.0000001.000000e+03872.000000936.000000
mean2012.783000113.1720006.7232001.698083e+0582.95637658.985043
std3.20596218.8109080.9454291.887626e+05103.25354017.194757
min2006.00000066.0000001.9000006.100000e+010.00000011.000000
25%2010.000000100.0000006.2000003.630900e+0413.27000047.000000
50%2014.000000111.0000006.8000001.107990e+0547.98500059.500000
75%2016.000000123.0000007.4000002.399098e+05113.71500072.000000
max2016.000000191.0000009.0000001.791916e+06936.630000100.000000

Extract the column names and convert them into a standard Python list:

PYTHON
col = df.columns
type(list(col))
OUTPUT
list

Display the index object containing all column names:

PLAINTEXT
col
PLAINTEXT
Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

You can rename columns by assigning a list of new names directly to the columns attribute:

PYTHON
col1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
df.columns = col1
df.head()
OUTPUT
Rankabcdefghijk
1Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0
2PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0
3SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0
4SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0
5Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0

Revert back to the original column names:

PYTHON
df.columns = col
df.head(0)
OUTPUT
RankTitleGenreDescriptionDirectorActorsYearRuntime (Minutes)RatingVotesRevenue (Millions)Metascore

For more precise renaming, use the rename() method with a dictionary mapping old column names to new ones:

PYTHON
df.rename(columns={
    'Runtime (Minutes)': 'Runtime',
    'Revenue (Millions)': 'Revenue'
}, inplace= True)

df.columns
OUTPUT
Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue', 'Metascore'],
      dtype='object')

Compare the updated column list with the original stored in the col variable:

PLAINTEXT
col
PLAINTEXT
Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

Missing Values

Real-world datasets often contain missing values. You can represent these missing entries in Python using NumPy's nan constant:

PYTHON
import numpy as np

Access the standard representation for empty values:

PYTHON
np.nan
OUTPUT
nan

Identify missing values in the DataFrame using isnull(), which returns boolean values indicating the presence of null values:

PYTHON
df.isnull().head(10)
OUTPUT
RankTitleGenreDescriptionDirectorActorsYearRuntimeRatingVotesRevenueMetascore
1FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
6FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
7FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
8FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
9FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
10FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

Sum the boolean values to count the number of missing records in each column:

PYTHON
df.isnull().sum()
OUTPUT
Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

Alternatively, call isna() to achieve the same result:

PYTHON
df.isna().sum()
OUTPUT
Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

Remove any rows containing at least one missing value using dropna():

PYTHON
df1 = df.dropna()
df1.shape
OUTPUT
(838, 11)

To drop columns that contain missing values instead of rows, set axis=1:

PYTHON
df2 = df.dropna(axis = 1)
df2.head(3)
OUTPUT
RankTitleGenreDescriptionDirectorActorsYearRuntimeRatingVotes
1Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074
2PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820
3SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606

Imputation

Rather than deleting rows or columns, you can fill missing cells with a specific value. Fill all missing entries with 0 using fillna():

PYTHON
df3 = df.fillna(0)
df3.isna().sum()
OUTPUT
Title          0
Genre          0
Description    0
Director       0
Actors         0
Year           0
Runtime        0
Rating         0
Votes          0
Revenue        0
Metascore      0
dtype: int64

Recall the counts of null values in the original DataFrame:

PYTHON
df.isnull().sum()
OUTPUT
Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

Extract the Revenue column as a Pandas Series:

PYTHON
revenue = df['Revenue']
type(revenue)
PYTHON
pandas.core.series.Series

Display the last five elements of the revenue Series to check for missing entries:

PYTHON
revenue.tail()
OUTPUT
Rank
996       NaN
997     17.54
998     58.01
999       NaN
1000    19.64
Name: Revenue, dtype: float64

Compute the average revenue to use for imputation:

PYTHON
revenue_mean = revenue.mean()
revenue_mean
OUTPUT
82.95637614678897

Fill the missing entries in revenue with its calculated mean value in-place:

PYTHON
revenue.fillna(revenue_mean, inplace= True)
revenue.tail()
OUTPUT
Rank
996     82.956376
997     17.540000
998     58.010000
999     82.956376
1000    19.640000
Name: Revenue, dtype: float64

Confirm that the revenue Series no longer contains null values:

PYTHON
revenue.isnull().sum()
OUTPUT
0

Assign the imputed Series back to the DataFrame and check the remaining null counts:

PYTHON
df['Revenue'] = revenue
df.isnull().sum()
OUTPUT
Title           0
Genre           0
Description     0
Director        0
Actors          0
Year            0
Runtime         0
Rating          0
Votes           0
Revenue         0
Metascore      64
dtype: int64

Apply the same mean-imputation process to the Metascore column:

PYTHON
metascore = df['Metascore']
metascore_mean = metascore.mean()
print("metascore_mean =",metascore_mean)
metascore.fillna(metascore_mean, inplace = True)
df['Metascore'] = metascore
df.isnull().sum()
OUTPUT
metascore_mean = 58.98504273504273

Title          0
Genre          0
Description    0
Director       0
Actors         0
Year           0
Runtime        0
Rating         0
Votes          0
Revenue        0
Metascore      0
dtype: int64

Verify the statistical summary of the numerical columns now that the missing values are imputed:

PYTHON
df.describe()
OUTPUT
YearRuntimeRatingVotesRevenueMetascore
count1000.0000001000.0000001000.0000001.000000e+031000.0000001000.000000
mean2012.783000113.1720006.7232001.698083e+0582.95637658.985043
std3.20596218.8109080.9454291.887626e+0596.41204316.634858
min2006.00000066.0000001.9000006.100000e+010.00000011.000000
25%2010.000000100.0000006.2000003.630900e+0417.44250047.750000
50%2014.000000111.0000006.8000001.107990e+0560.37500058.985043
75%2016.000000123.0000007.4000002.399098e+0599.17750071.000000
max2016.000000191.0000009.0000001.791916e+06936.630000100.000000

Print the updated info schema to confirm all columns have 1000 non-null values:

PYTHON
df.info()
OUTPUT
Int64Index: 1000 entries, 1 to 1000
Data columns (total 11 columns):
Title          1000 non-null object
Genre          1000 non-null object
Description    1000 non-null object
Director       1000 non-null object
Actors         1000 non-null object
Year           1000 non-null int64
Runtime        1000 non-null int64
Rating         1000 non-null float64
Votes          1000 non-null int64
Revenue        1000 non-null float64
Metascore      1000 non-null float64
dtypes: float64(3), int64(3), object(5)
memory usage: 93.8+ KB

Get a statistical summary of the non-numerical Genre column:

PYTHON
df['Genre'].describe()
OUTPUT
count                        1000
unique                        207
top       Action,Adventure,Sci-Fi
freq                           50
Name: Genre, dtype: object

Count the frequencies of the top ten movie genres:

PYTHON
df['Genre'].value_counts().head(10)
OUTPUT
Action,Adventure,Sci-Fi       50
Drama                         48
Comedy,Drama,Romance          35
Comedy                        32
Drama,Romance                 31
Comedy,Drama                  27
Action,Adventure,Fantasy      27
Animation,Adventure,Comedy    27
Comedy,Romance                26
Crime,Drama,Thriller          24
Name: Genre, dtype: int64

Check the count of unique movie genres present in the dataset:

PYTHON
len(df['Genre'].unique())
OUTPUT
207

Correlation Analysis

Compute the pairwise Pearson correlation matrix for the numerical columns using the corr() method:

PYTHON
corrmat = df.corr()
corrmat
OUTPUT
YearRuntimeRatingVotesRevenueMetascore
Year1.000000-0.164900-0.211219-0.411904-0.117562-0.076077
Runtime-0.1649001.0000000.3922140.4070620.2478340.202239
Rating-0.2112190.3922141.0000000.5115370.1895270.604723
Votes-0.4119040.4070620.5115371.0000000.6079410.318116
Revenue-0.1175620.2478340.1895270.6079411.0000000.132304
Metascore-0.0760770.2022390.6047230.3181160.1323041.000000

You can compute relationships between numerical attributes using correlation matrices. To visualize these correlation matrices as colors, import the Seaborn library:

PYTHON
import seaborn as sns

Plot the correlation matrix as a color-encoded heat map:

PYTHON
sns.heatmap(corrmat)

You can also use Matplotlib to generate visualizations directly from DataFrames. Import the Matplotlib pyplot interface:

PYTHON
import matplotlib.pyplot as plt

Generate a scatter plot of movie ratings versus revenues:

PYTHON
df.plot(kind = 'scatter', x = 'Rating', y = 'Revenue', title = 'Revenue vs Rating')

Create a histogram plot to view the frequency distribution of ratings:

PYTHON
df['Rating'].plot(kind = 'hist', title = 'Rating')

Estimate the probability density of ratings using a Kernel Density Estimation (KDE) plot:

PYTHON
df['Rating'].plot(kind = 'kde', title = 'Rating')

Count the frequency of each rating category:

PYTHON
df['Rating'].value_counts()
OUTPUT
7.1    52
6.7    48
7.0    46
6.3    44
6.6    42
7.2    42
7.3    42
6.5    40
7.8    40
6.2    37
6.8    37
7.5    35
6.4    35
7.4    33
6.9    31
6.1    31
7.6    27
7.7    27
5.8    26
6.0    26
8.1    26
7.9    23
5.7    21
8.0    19
5.9    19
5.6    17
5.5    14
5.3    12
5.4    12
5.2    11
8.2    10
4.9     7
8.3     7
4.7     6
8.5     6
4.6     5
5.1     5
5.0     4
4.8     4
4.3     4
8.4     4
3.9     3
8.6     3
8.8     2
2.7     2
4.2     2
3.5     2
3.7     2
9.0     1
3.2     1
4.0     1
4.5     1
4.4     1
4.1     1
1.9     1
Name: Rating, dtype: int64

To visualize the distribution, spread, and outliers of ratings, you can construct a box plot. Box plots represent the five-number summary of a dataset: the minimum, first quartile (), median, third quartile (), and maximum:

A standard box plot anatomy diagram showing minimum, lower quartile, median, upper quartile, maximum, and whiskers

Generate a box plot of movie ratings:

PYTHON
df['Rating'].plot(kind = 'box')

Compute numerical statistics for movie ratings to match the box plot coordinates:

PYTHON
df['Rating'].describe()
OUTPUT
count    1000.000000
mean        6.723200
std         0.945429
min         1.900000
25%         6.200000
50%         6.800000
75%         7.400000
max         9.000000
Name: Rating, dtype: float64

Add a new rating category column to label movies as good or bad depending on their rating threshold:

PYTHON
rating_cat = []
for rate in df['Rating']:
    if rate > 6.2:
        rating_cat.append('Good')
    else:
        rating_cat.append('Bad')

rating_cat[:20]
OUTPUT
['Good', 'Good', 'Good', 'Good', 'Bad', 'Bad', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good', 'Good']

Assign the list to the DataFrame and display the head to confirm the column addition:

PYTHON
df['Rating Category'] = rating_cat
df.head()
OUTPUT
RankTitleGenreDescriptionDirectorActorsYearRuntimeRatingVotesRevenueMetascoreRating Category
1Guardians of the GalaxyAction,Adventure,Sci-FiA group of intergalactic criminals are forced ...James GunnChris Pratt, Vin Diesel, Bradley Cooper, Zoe S...20141218.1757074333.1376.0Good
2PrometheusAdventure,Mystery,Sci-FiFollowing clues to the origin of mankind, a te...Ridley ScottNoomi Rapace, Logan Marshall-Green, Michael Fa...20121247.0485820126.4665.0Good
3SplitHorror,ThrillerThree girls are kidnapped by a man with a diag...M. Night ShyamalanJames McAvoy, Anya Taylor-Joy, Haley Lu Richar...20161177.3157606138.1262.0Good
4SingAnimation,Comedy,FamilyIn a city of humanoid animals, a hustling thea...Christophe LourdeletMatthew McConaughey,Reese Witherspoon, Seth Ma...20161087.260545270.3259.0Good
5Suicide SquadAction,Adventure,FantasyA secret government agency recruits some of th...David AyerWill Smith, Jared Leto, Margot Robbie, Viola D...20161236.2393727325.0240.0Bad

Group your data by a categorical feature and draw box plots for each group using the boxplot() method:

PYTHON
df.boxplot(column= 'Revenue', by = 'Rating Category')

Conclusion

In this tutorial, you explored the core functionalities of the Pandas library, including DataFrame creation, duplicate removal, column renaming, and handling missing values using mean imputation. Using the NBA and IMDB datasets, you walked through essential data cleaning steps and calculated pairwise correlation to assess linear dependencies between numeric variables.

Key takeaways:

  • DataFrames and Series: A DataFrame is a two-dimensional tabular structure, while a Series represents a single column within that DataFrame.
  • Wrangling and Deduplication: Tabular data often contains duplicates that can be filtered out using drop_duplicates(inplace=True) to clean the workspace.
  • Handling Null Values: Missing data (NaN) should either be dropped or imputed using operations like fillna() with statistical metrics such as the column mean.
  • Correlation: The corr() method helps calculate pairwise relationship coefficients to find dependencies before building predictive models.

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments