Pandas Crash Course

Published by georgiannacambel on 17 September 202017 September 2020

What is Pandas?

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, in Python programming language. It is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy package and its key data structure is called the DataFrame. DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables. Pandas is mainly used for data analysis. Pandas allows importing data from various file formats such as comma-separated values, JSON, SQL, Microsoft Excel. Pandas allows various data manipulation operations such as merging, reshaping, selecting, as well as data cleaning, and data wrangling features.

DataFrame object for data manipulation with integrated indexing.
Tools for reading and writing data between in-memory data structures and different file formats.
Data alignment and integrated handling of missing data.
Reshaping and pivoting of data sets.
Label-based slicing, fancy indexing, and subsetting of large data sets.
Data structure column insertion and deletion.
Group by engine allowing split-apply-combine operations on data sets.
Data set merging and joining.
Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
Time series-functionality: Date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
Provides data filtration.

Dataset

You can download all the datasets used in this notebook from here.

Let’s start!

We will first start by importing pandas.

import pandas as pd

We will prepare a python dictionary named data.

data = {
    'apple': [3,1,4,5],
    'orange': [1, 5, 6, 8]
}

data

{'apple': [3, 1, 4, 5], 'orange': [1, 5, 6, 8]}

type() method returns class type of the argument(object) passed as parameter. dict means that data is a dictionary.

type(data)

dict

Now we will convert data into a DataFrame. Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

df = pd.DataFrame(data)
df

	apple	orange
0	3	1
1	1	5
2	4	6
3	5	8

Each column of df is a Series. df['apple'] returns only the column with header 'apple'. If we check the type of this column we can see that its a Series.

df['apple']

0    3
1    1
2    4
3    5
Name: apple, dtype: int64

type(df['apple'])

pandas.core.series.Series

Reading CSV files

Now we will see how to read CSV(Comma Separated Values) files into a pandas dataframe. read_csv() reads a comma-separated values file into DataFrame.

df = pd.read_csv('nba.csv')

head(n) function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it. Below we can see the first 10 rows of df.

df.head(10)

	Name	Team	Number	Position	Age	Height	Weight	College	Salary
0	Avery Bradley	Boston Celtics	0.0	PG	25.0	6-2	180.0	Texas	7730337.0
1	Jae Crowder	Boston Celtics	99.0	SF	25.0	6-6	235.0	Marquette	6796117.0
2	John Holland	Boston Celtics	30.0	SG	27.0	6-5	205.0	Boston University	NaN
3	R.J. Hunter	Boston Celtics	28.0	SG	22.0	6-5	185.0	Georgia State	1148640.0
4	Jonas Jerebko	Boston Celtics	8.0	PF	29.0	6-10	231.0	NaN	5000000.0
5	Amir Johnson	Boston Celtics	90.0	PF	29.0	6-9	240.0	NaN	12000000.0
6	Jordan Mickey	Boston Celtics	55.0	PF	21.0	6-8	235.0	LSU	1170960.0
7	Kelly Olynyk	Boston Celtics	41.0	C	25.0	7-0	238.0	Gonzaga	2165160.0
8	Terry Rozier	Boston Celtics	12.0	PG	22.0	6-2	190.0	Louisville	1824360.0
9	Marcus Smart	Boston Celtics	36.0	PG	22.0	6-4	220.0	Oklahoma State	3431040.0

tail(n) returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.

df.tail(2)

	Name	Team	Number	Position	Age	Height	Weight	College	Salary
456	Jeff Withey	Utah Jazz	24.0	C	26.0	7-0	231.0	Kansas	947276.0
457	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Using index_col we can specify the column(s) to use as the row labels of the DataFrame. We can either given as string name or column index. By default the row labels are numbers from 0 to n-1 where n is the total number of rows in the data.

df = pd.read_csv('nba.csv', index_col = 'Name')
df.head()

	Team	Number	Position	Age	Height	Weight	College	Salary
Name
Avery Bradley	Boston Celtics	0.0	PG	25.0	6-2	180.0	Texas	7730337.0
Jae Crowder	Boston Celtics	99.0	SF	25.0	6-6	235.0	Marquette	6796117.0
John Holland	Boston Celtics	30.0	SG	27.0	6-5	205.0	Boston University	NaN
R.J. Hunter	Boston Celtics	28.0	SG	22.0	6-5	185.0	Georgia State	1148640.0
Jonas Jerebko	Boston Celtics	8.0	PF	29.0	6-10	231.0	NaN	5000000.0

Now we will load the IMDB-Movie-Data into df using ‘Rank’ as the row label.

df = pd.read_csv('IMDB-Movie-Data.csv', index_col = 'Rank')
df.head()

	Title	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
Rank
1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced …	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S…	2014	121	8.1	757074	333.13	76.0
2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te…	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa…	2012	124	7.0	485820	126.46	65.0
3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag…	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar…	2016	117	7.3	157606	138.12	62.0
4	Sing	Animation,Comedy,Family	In a city of humanoid animals, a hustling thea…	Christophe Lourdelet	Matthew McConaughey,Reese Witherspoon, Seth Ma…	2016	108	7.2	60545	270.32	59.0
5	Suicide Squad	Action,Adventure,Fantasy	A secret government agency recruits some of th…	David Ayer	Will Smith, Jared Leto, Margot Robbie, Viola D…	2016	123	6.2	393727	325.02	40.0

df.tail()

	Title	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
Rank
996	Secret in Their Eyes	Crime,Drama,Mystery	A tight-knit team of rising investigators, alo…	Billy Ray	Chiwetel Ejiofor, Nicole Kidman, Julia Roberts…	2015	111	6.2	27585	NaN	45.0
997	Hostel: Part II	Horror	Three American college students studying abroa…	Eli Roth	Lauren German, Heather Matarazzo, Bijou Philli…	2007	94	5.5	73152	17.54	46.0
998	Step Up 2: The Streets	Drama,Music,Romance	Romantic sparks occur between two dance studen…	Jon M. Chu	Robert Hoffman, Briana Evigan, Cassie Ventura,…	2008	98	6.2	70699	58.01	50.0
999	Search Party	Adventure,Comedy	A pair of friends embark on a mission to reuni…	Scot Armstrong	Adam Pally, T.J. Miller, Thomas Middleditch,Sh…	2014	93	5.6	4881	NaN	22.0
1000	Nine Lives	Comedy,Family,Fantasy	A stuffy businessman finds himself trapped ins…	Barry Sonnenfeld	Kevin Spacey, Jennifer Garner, Robbie Amell,Ch…	2016	87	5.3	12435	19.64	11.0

Now we will print a concise summary of a DataFrame. info() prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 1 to 1000
Data columns (total 11 columns):
Title                 1000 non-null object
Genre                 1000 non-null object
Description           1000 non-null object
Director              1000 non-null object
Actors                1000 non-null object
Year                  1000 non-null int64
Runtime (Minutes)     1000 non-null int64
Rating                1000 non-null float64
Votes                 1000 non-null int64
Revenue (Millions)    872 non-null float64
Metascore             936 non-null float64
dtypes: float64(3), int64(3), object(5)
memory usage: 93.8+ KB

shape returns a tuple representing the dimensionality of the DataFrame. It returns (rows,columns) in the dataframe.

df.shape

(1000, 11)

duplicated() return boolean Series denoting duplicate rows. As the sum of all the elements in the series is 0, there are no duplicate rows.

sum(df.duplicated())

append() function is used to append rows of other dataframe to the end of the given dataframe, returning a new dataframe object. Below we are appending rows of df to the end of df itself, returning a new object. Hence we can see the number of rows have doubled.

df1 = df.append(df)
df1.shape

(2000, 11)

Now if you check the number of duplicated rows you can see that there are 1000 duplicated rows.

df1.duplicated().sum()

drop_duplicates() removes the duplicates rows and returns DataFrame.

df2 = df1.drop_duplicates()
df2.shape

(1000, 11)

df1.shape

(2000, 11)

Using inplace we can specify whether to drop duplicates in place or to return a copy.

df1.drop_duplicates(inplace = True)
df1.shape

(1000, 11)

Column Cleanup

df.columns gives the column labels of the DataFrame.

df.columns

Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

len() gives the length. There are 11 columns in df.

len(df.columns)

To generate a descriptive statistics we can use describe(). Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

df.describe()

	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
count	1000.000000	1000.000000	1000.000000	1.000000e+03	872.000000	936.000000
mean	2012.783000	113.172000	6.723200	1.698083e+05	82.956376	58.985043
std	3.205962	18.810908	0.945429	1.887626e+05	103.253540	17.194757
min	2006.000000	66.000000	1.900000	6.100000e+01	0.000000	11.000000
25%	2010.000000	100.000000	6.200000	3.630900e+04	13.270000	47.000000
50%	2014.000000	111.000000	6.800000	1.107990e+05	47.985000	59.500000
75%	2016.000000	123.000000	7.400000	2.399098e+05	113.715000	72.000000
max	2016.000000	191.000000	9.000000	1.791916e+06	936.630000	100.000000

Now we will make a list of the column names.

col = df.columns
type(list(col))

list

col

Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

We can rename the columns by assigning a list of new names to df.columns.

col1 = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k']
df.columns = col1
df.head()

	a	b	c	d	e	f	g	h	i	j	k
Rank
1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced …	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S…	2014	121	8.1	757074	333.13	76.0
2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te…	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa…	2012	124	7.0	485820	126.46	65.0
3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag…	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar…	2016	117	7.3	157606	138.12	62.0
4	Sing	Animation,Comedy,Family	In a city of humanoid animals, a hustling thea…	Christophe Lourdelet	Matthew McConaughey,Reese Witherspoon, Seth Ma…	2016	108	7.2	60545	270.32	59.0
5	Suicide Squad	Action,Adventure,Fantasy	A secret government agency recruits some of th…	David Ayer	Will Smith, Jared Leto, Margot Robbie, Viola D…	2016	123	6.2	393727	325.02	40.0

Now we will again change and use the original names.

df.columns = col
df.head(0)

	Title	Genre	Description	Director	Actors	Year	Runtime (Minutes)	Rating	Votes	Revenue (Millions)	Metascore
Rank

We can even rename the columns using the rename() function. We have passed a dictionary as the parameter which specifies the old name:new name as the key-value pair.

df.rename(columns={
    'Runtime (Minutes)': 'Runtime',
    'Revenue (Millions)': 'Revenue'
}, inplace= True)

df.columns

Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime', 'Rating', 'Votes', 'Revenue', 'Metascore'],
      dtype='object')

You can compare the previous names displayed below and the new names displayed above.

col

Index(['Title', 'Genre', 'Description', 'Director', 'Actors', 'Year',
       'Runtime (Minutes)', 'Rating', 'Votes', 'Revenue (Millions)',
       'Metascore'],
      dtype='object')

Missing values

Now we will see how to handle missing values(nan). For that we will import numpy.

import numpy as np

nan is a numpy constant which represents Not a Number (NaN).

np.nan

nan

isnull() function detect missing values in the given series object. It return a boolean same-sized object indicating if the values are NA. Missing values gets mapped to True and non-missing value gets mapped to False.

df.isnull().head(10)

	Title	Genre	Description	Director	Actors	Year	Runtime	Rating	Votes	Revenue	Metascore
Rank
1	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False
5	False	False	False	False	False	False	False	False	False	False	False
6	False	False	False	False	False	False	False	False	False	False	False
7	False	False	False	False	False	False	False	False	False	False	False
8	False	False	False	False	False	False	False	False	False	True	False
9	False	False	False	False	False	False	False	False	False	False	False
10	False	False	False	False	False	False	False	False	False	False	False

To get a better picture we can sum the above result. Now we can see that Revenue has 128 null values and Metascore has 64 null values.

df.isnull().sum()

Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

Alternatively, we can even use isna(). It returns a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values.

df.isna().sum()

Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

dropna() removes the rows which contain missing values. We can see that 162 rows are removed.

df1 = df.dropna()
df1.shape

(838, 11)

If we specify axis = 1, columns which contain missing values are dropped.

df2 = df.dropna(axis = 1)
df2.head(3)

	Title	Genre	Description	Director	Actors	Year	Runtime	Rating	Votes
Rank
1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced …	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S…	2014	121	8.1	757074
2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te…	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa…	2012	124	7.0	485820
3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag…	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar…	2016	117	7.3	157606

Imputation

Imputation is the process of replacing missing data with substituted values.

fillna(0) fills NA/NaN values with 0. You can see that all the nan values are replaced by 0.

df3 = df.fillna(0)
df3.isna().sum()

Title          0
Genre          0
Description    0
Director       0
Actors         0
Year           0
Runtime        0
Rating         0
Votes          0
Revenue        0
Metascore      0
dtype: int64

df has 128 null values in Revenue and 64 null values in Metascore

df.isnull().sum()

Title            0
Genre            0
Description      0
Director         0
Actors           0
Year             0
Runtime          0
Rating           0
Votes            0
Revenue        128
Metascore       64
dtype: int64

revenue is a Series which contains the values of the column Revenue.

revenue = df['Revenue']
type(revenue)

pandas.core.series.Series

It has some NaN values.

revenue.tail()

Rank
996       NaN
997     17.54
998     58.01
999       NaN
1000    19.64
Name: Revenue, dtype: float64

mean() returns the mean of the values for the requested axis. Here we will get the mean of the values in revenue.

revenue_mean = revenue.mean()
revenue_mean

82.95637614678897

Now we will fill the missing values with the mean.

revenue.fillna(revenue_mean, inplace= True)
revenue.tail()

Rank
996     82.956376
997     17.540000
998     58.010000
999     82.956376
1000    19.640000
Name: Revenue, dtype: float64

We can see that now there are no null values.

revenue.isnull().sum()

We will modify df by adding the new revenue values. Now we can see that Revenue doesn’t have any null values but Metascore has null values.

df['Revenue'] = revenue
df.isnull().sum()

Title           0
Genre           0
Description     0
Director        0
Actors          0
Year            0
Runtime         0
Rating          0
Votes           0
Revenue         0
Metascore      64
dtype: int64

In the same way we will replace the null values in Metascore by the mean of the values in Metascore.

metascore = df['Metascore']
metascore_mean = metascore.mean()
print("metascore_mean =",metascore_mean)
metascore.fillna(metascore_mean, inplace = True)
df['Metascore'] = metascore
df.isnull().sum()

metascore_mean = 58.98504273504273

Title          0
Genre          0
Description    0
Director       0
Actors         0
Year           0
Runtime        0
Rating         0
Votes          0
Revenue        0
Metascore      0
dtype: int64

describe() gives statistical data of the numerical columns only.

df.describe()

	Year	Runtime	Rating	Votes	Revenue	Metascore
count	1000.000000	1000.000000	1000.000000	1.000000e+03	1000.000000	1000.000000
mean	2012.783000	113.172000	6.723200	1.698083e+05	82.956376	58.985043
std	3.205962	18.810908	0.945429	1.887626e+05	96.412043	16.634858
min	2006.000000	66.000000	1.900000	6.100000e+01	0.000000	11.000000
25%	2010.000000	100.000000	6.200000	3.630900e+04	17.442500	47.750000
50%	2014.000000	111.000000	6.800000	1.107990e+05	60.375000	58.985043
75%	2016.000000	123.000000	7.400000	2.399098e+05	99.177500	71.000000
max	2016.000000	191.000000	9.000000	1.791916e+06	936.630000	100.000000

info() gives information about all the columns.

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 1 to 1000
Data columns (total 11 columns):
Title          1000 non-null object
Genre          1000 non-null object
Description    1000 non-null object
Director       1000 non-null object
Actors         1000 non-null object
Year           1000 non-null int64
Runtime        1000 non-null int64
Rating         1000 non-null float64
Votes          1000 non-null int64
Revenue        1000 non-null float64
Metascore      1000 non-null float64
dtypes: float64(3), int64(3), object(5)
memory usage: 93.8+ KB

We can even get the describtion about a specific column.

df['Genre'].describe()

count                        1000
unique                        207
top       Action,Adventure,Sci-Fi
freq                           50
Name: Genre, dtype: object

value_counts() returns a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Below we have displayed the counts of the first 10 unique values in Genre column.

df['Genre'].value_counts().head(10)

Action,Adventure,Sci-Fi       50
Drama                         48
Comedy,Drama,Romance          35
Comedy                        32
Drama,Romance                 31
Comedy,Drama                  27
Action,Adventure,Fantasy      27
Animation,Adventure,Comedy    27
Comedy,Romance                26
Crime,Drama,Thriller          24
Name: Genre, dtype: int64

unique() returns all the unique values in order of appearance. len() is used to get the total number of unique values in Genre which is 207.

len(df['Genre'].unique())

Corr method

corr() is used to compute pairwise correlation of columns, excluding NA/null values. For any non-numeric data type columns in the dataframe it is ignored.

corrmat = df.corr()
corrmat

	Year	Runtime	Rating	Votes	Revenue	Metascore
Year	1.000000	-0.164900	-0.211219	-0.411904	-0.117562	-0.076077
Runtime	-0.164900	1.000000	0.392214	0.407062	0.247834	0.202239
Rating	-0.211219	0.392214	1.000000	0.511537	0.189527	0.604723
Votes	-0.411904	0.407062	0.511537	1.000000	0.607941	0.318116
Revenue	-0.117562	0.247834	0.189527	0.607941	1.000000	0.132304
Metascore	-0.076077	0.202239	0.604723	0.318116	0.132304	1.000000

Now we will visualize the correlation. For that we have imported seaborn library.

import seaborn as sns

heatmap() plots rectangular data as a color-encoded matrix. A heatmap is a two-dimensional graphical representation of data where the individual values that are contained in a matrix are represented as colors.

sns.heatmap(corrmat)

Now we will draw plots using matplotlib. For that we will import pyplot from matplotlib.

import matplotlib.pyplot as plt

plot() makes plots of Series or DataFrame. Here we are making a scatter plot of Rating versus Revenue

df.plot(kind = 'scatter', x = 'Rating', y = 'Revenue', title = 'Revenue vs Rating')

Now we will plot a histogram of Rating.

df['Rating'].plot(kind = 'hist', title = 'Rating')

Here we have drawn a Kernel Density Estimation plot of Rating.

df['Rating'].plot(kind = 'kde', title = 'Rating')

The above plots shows us the distribution of values in Rating. Most of the ratings are disrtibuted between 6 and 8.

df['Rating'].value_counts()

7.1    52
6.7    48
7.0    46
6.3    44
6.6    42
7.2    42
7.3    42
6.5    40
7.8    40
6.2    37
6.8    37
7.5    35
6.4    35
7.4    33
6.9    31
6.1    31
7.6    27
7.7    27
5.8    26
6.0    26
8.1    26
7.9    23
5.7    21
8.0    19
5.9    19
5.6    17
5.5    14
5.3    12
5.4    12
5.2    11
8.2    10
4.9     7
8.3     7
4.7     6
8.5     6
4.6     5
5.1     5
5.0     4
4.8     4
4.3     4
8.4     4
3.9     3
8.6     3
8.8     2
2.7     2
4.2     2
3.5     2
3.7     2
9.0     1
3.2     1
4.0     1
4.5     1
4.4     1
4.1     1
1.9     1
Name: Rating, dtype: int64

Now we will plot a box plot. Box plots show the five-number summary of a set of data: including the minimum, first (lower) quartile, median, third (upper) quartile, and maximum. Box plots divide the data into sections that each contain approximately 25% of the data in that set. The first quartile is the 25th percentile. Second quartile is 50th percentile and third quartile is 75th percentile.

The points outside the whiskers of the boxplot are called outliers. These “too far away” points are called “outliers”, because they “lie outside” the range in which we expect them.

df['Rating'].plot(kind = 'box')

You can compare the values below and the values shown by the boxplot.

df['Rating'].describe()

count    1000.000000
mean        6.723200
std         0.945429
min         1.900000
25%         6.200000
50%         6.800000
75%         7.400000
max         9.000000
Name: Rating, dtype: float64

Now we will add a new column Rating Category to df. If the Rating is more than 6.2 then Rating Category will be good or it will be bad.

rating_cat = []
for rate in df['Rating']:
    if rate > 6.2:
        rating_cat.append('Good')
    else:
        rating_cat.append('Bad')
        
rating_cat[:20]

['Good',
 'Good',
 'Good',
 'Good',
 'Bad',
 'Bad',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good',
 'Good']

df['Rating Category'] = rating_cat
df.head()

	Title	Genre	Description	Director	Actors	Year	Runtime	Rating	Votes	Revenue	Metascore	Rating Category
Rank
1	Guardians of the Galaxy	Action,Adventure,Sci-Fi	A group of intergalactic criminals are forced …	James Gunn	Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S…	2014	121	8.1	757074	333.13	76.0	Good
2	Prometheus	Adventure,Mystery,Sci-Fi	Following clues to the origin of mankind, a te…	Ridley Scott	Noomi Rapace, Logan Marshall-Green, Michael Fa…	2012	124	7.0	485820	126.46	65.0	Good
3	Split	Horror,Thriller	Three girls are kidnapped by a man with a diag…	M. Night Shyamalan	James McAvoy, Anya Taylor-Joy, Haley Lu Richar…	2016	117	7.3	157606	138.12	62.0	Good
4	Sing	Animation,Comedy,Family	In a city of humanoid animals, a hustling thea…	Christophe Lourdelet	Matthew McConaughey,Reese Witherspoon, Seth Ma…	2016	108	7.2	60545	270.32	59.0	Good
5	Suicide Squad	Action,Adventure,Fantasy	A secret government agency recruits some of th…	David Ayer	Will Smith, Jared Leto, Margot Robbie, Viola D…	2016	123	6.2	393727	325.02	40.0	Bad

We can specify a column in the DataFrame to pandas.DataFrame.groupby() using by. One box-plot will be done per value of columns in by. Hence we have got 2 boxplots separately for good and bad.

df.boxplot(column= 'Revenue', by = 'Rating Category')

Pandas Crash Course

What is Pandas?

Dataset

Let’s start!

Reading CSV files

Column Cleanup

Missing values

Imputation

Corr method

0 Comments

Leave a Reply Cancel reply

Feature Engineering Tutorial Series 6: Variable magnitude

Feature Engineering Tutorial Series 5: Outliers

Feature Engineering Tutorial Series 4: Linear Model Assumptions

Pandas Crash Course

What is Pandas?

Dataset

Let’s start!

Reading CSV files

Column Cleanup

Missing values

Imputation

Corr method

0 Comments

Leave a Reply Cancel reply

Related Posts

Feature Engineering Tutorial Series 6: Variable magnitude

Feature Engineering Tutorial Series 5: Outliers

Feature Engineering Tutorial Series 4: Linear Model Assumptions