Session 03: Introduction to Pandas

Pandas

DataFrames

Environment Setup

Create 04_data_import.ipynb file in the notebooks folder.
Import the necessary libraries: pandas and numpy.
Dowload the dataset from this link and save it in the data/raw folder as Archive.zip
Unzipping the file you will get 2 csv files:
1. orders.csv
2. products.csv

Unzipping the files

There are 3 options to unzip the file:

usiing the terminal
using the file explorer
using Python code

Guess which one we will use in this course? Yes, you are right! We will use Python code to unzip the file. :)

import zipfile
with zipfile.ZipFile("../data/raw/Archive.zip", "r") as zip_ref:
    zip_ref.extractall("../data/raw")

As you can see, we are using the zipfile library to unzip the file. We are opening the zip file in read r mode and then extracting all the files to the specified location.

__MACOSX

If you are using MacOS, after the extraction you might see other folder as named __MACOSX. It is a metadate and you should remove it.

CHALLANGE: Try to delete the __MACOSX file if exists.

Important

Pay attention to the size of the of the orders.csv file. It is around 109 GB. So, it will not be possible to push it to GitHub. We will use only a sample of the data for our analysis.

We can either:

Create a sample of the data using Python code and save it as a new csv file.
Push zip file to GitHub and unzip it locally on our machines (just like we did in the code above) and then create a sample of the data using Python code and save it as a new csv file.

Core Data Structures

Pandas has two primary data structures:

Series: 1-dimensional labeled array
DataFrame: 2-dimensional labeled table

Series Example

import pandas as pd

s = pd.Series([10, 20, 30, 40])
s

0    10
1    20
2    30
3    40
dtype: int64

A Series consists of:

Values
Index

DataFrame Example

data = {
    "customer": ["Anna", "David", "Liza"],
    "revenue": [100, 250, 180]
}

df = pd.DataFrame(data)
df

	customer	revenue
0	Anna	100
1	David	250
2	Liza	180

A DataFrame consists of:

Rows (index)
Columns
Values

In mathematical form, a DataFrame can be represented as a matrix:

\[ X = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix} \]

Where:

\(n\) = number of observations (rows)
\(p\) = number of variables (columns)

Importing the data

Now that we have unzipped the files, we can import them into our Jupyter Notebook using the pandas library.

Why CSV files?

We will learn importing other formats of data in the next sessions. For now, we will focus on importing csv files, which is the most common format for storing tabular data.

Importing `orders.csv` file

import pandas as pd

df_orders = pd.read_csv("../data/raw/orders.csv")
df_orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

Important

try to make a heabit of using head() method after importing the data to check if the data is imported correctly and to get a quick overview of the data.

Importing `products.csv` file

df_products = pd.read_csv("../data/raw/products.csv")
df_products.head()

	product_id	product_name	aisle_id	department_id	prices
0	1	Chocolate Sandwich Cookies	61	19	5.8
1	2	All-Seasons Salt	104	13	9.3
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.5
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.5
4	5	Green Chile Anytime Sauce	5	13	4.3

Common parameters

pd.read_csv(
    "{data/sales.csv}",
    sep=",",
    header=0,
    na_values=["NA", ""],
    parse_dates=["order_date"]
)

Exploring Data with Pandas

Top `N` rows of the data

Default value of N is 5, so if you do not specify the number of rows to display, it will show you the top 5 rows of the DataFrame.

df_products.head()

	product_id	product_name	aisle_id	department_id	prices
0	1	Chocolate Sandwich Cookies	61	19	5.8
1	2	All-Seasons Salt	104	13	9.3
2	3	Robust Golden Unsweetened Oolong Tea	94	7	4.5
3	4	Smart Ones Classic Favorites Mini Rigatoni Wit...	38	1	10.5
4	5	Green Chile Anytime Sauce	5	13	4.3

df_orders.head()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

Note

You can specify the number of rows to display by passing an integer to the head() method. For example, df.head(10) will display the top 10 rows of the DataFrame.

Top `N` rows of the data

df_products.tail()

	product_id	product_name	aisle_id	department_id	prices
49688	49684	Vodka, Triple Distilled, Twist of Vanilla	124	5	5.3
49689	49685	En Croute Roast Hazelnut Cranberry	42	1	3.1
49690	49686	Artisan Baguette	112	3	7.8
49691	49687	Smartblend Healthy Metabolism Dry Cat Food	41	8	4.7
49692	49688	Fresh Foaming Cleanser	73	11	13.5

df_orders.tail()

	order_id	user_id	eval_set	order_number	order_dow	order_hour_of_day	days_since_prior_order
3421078	2266710	206209	prior	10	5	18	29.0
3421079	1854736	206209	prior	11	4	10	30.0
3421080	626363	206209	prior	12	1	12	18.0
3421081	2977660	206209	prior	13	1	12	7.0
3421082	272231	206209	train	14	6	14	30.0

From this simple overview of the data can see the strucutre and column names of each DataFrame.

Column names of the DataFrame

Orders [‘order_id’, ‘user_id’, ‘eval_set’, ‘order_number’, ‘order_dow’, ‘order_hour_of_day’, ‘days_since_prior_order’] columns
Products [‘product_id’, ‘product_name’, ‘aisle_id’, ‘department_id’, ‘prices’] columns

Get Columns

Although we can get the column names using list(df.columns), it is more common to use df.columns to get the column names as an Index object.

df_orders.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='str')

df_products.columns

Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices'], dtype='str')

Data types of the columns

Another handy function is the df.info() function. It returns some basic information about your dataframe, for instance, how many rows and columns it has, what the columns are called, and what data types the columns contain.

df_orders.info()

<class 'pandas.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                str    
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), str(1)
memory usage: 198.9 MB

df_orders.info()

<class 'pandas.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                str    
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), str(1)
memory usage: 198.9 MB

There’s also a separate dedicated function just for checking the data types of a dataframe’s columns. If you only want to check the data types (and nothing else), you can use the df.dtypes function for a cleaner output:

df_orders.dtypes

Summary statistics of the data

With just one function, you can automatically generate all those descriptive statistics you should be well familiar with by this point:

count
mean
standard deviation
minimum

df_orders.describe()

	order_id	user_id	order_number	order_dow	order_hour_of_day	days_since_prior_order
count	3.421083e+06	3.421083e+06	3.421083e+06	3.421083e+06	3.421083e+06	3.214874e+06
mean	1.710542e+06	1.029782e+05	1.715486e+01	2.776219e+00	1.345202e+01	1.111484e+01
std	9.875817e+05	5.953372e+04	1.773316e+01	2.046829e+00	4.226088e+00	9.206737e+00
min	1.000000e+00	1.000000e+00	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
25%	8.552715e+05	5.139400e+04	5.000000e+00	1.000000e+00	1.000000e+01	4.000000e+00
50%	1.710542e+06	1.026890e+05	1.100000e+01	3.000000e+00	1.300000e+01	7.000000e+00
75%	2.565812e+06	1.543850e+05	2.300000e+01	5.000000e+00	1.600000e+01	1.500000e+01
max	3.421083e+06	2.062090e+05	1.000000e+02	6.000000e+00	2.300000e+01	3.000000e+01

Tip

Compare this method with typing and copying formulas in Excel!
Compare this method with writing SQL queries to get the same information.

Data Wrangling and Subsetting

The term “data wrangling” is just a fancy term for conducting manipulations or transformations on data. When you’re working with large sets of data, it can be difficult to see all the data entries and values in your dataframe. This is where Python (and pandas!) comes in.

Data Wrangling Procedures

Dropping columns
Renaming columns
Changing data types
Transforming data (e.g., creating new columns, applying functions to columns)

Dropping columns

The function itself is df.drop(), and within its parentheses comes the argument, which, in this case, is a list. The list should contain the individual columns that you want to drop, enclosed in quotation marks. To remove the eval_set column from your orders.csv dataframe, the full function would be:

Create a new DataFrame without the eval_set column

Option 1:

df_orders_new = df_orders.drop(columns=["eval_set"])

Option 2:

df_orders.drop(columns = ['eval_set'])

	order_id	user_id	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	1	2	8	NaN
1	2398795	1	2	3	7	15.0
2	473747	1	3	3	12	21.0
3	2254736	1	4	4	7	29.0
4	431534	1	5	4	15	28.0
...	...	...	...	...	...	...
3421078	2266710	206209	10	5	18	29.0
3421079	1854736	206209	11	4	10	30.0
3421080	626363	206209	12	1	12	18.0
3421081	2977660	206209	13	1	12	7.0
3421082	272231	206209	14	6	14	30.0

3421083 rows × 6 columns

df_orders.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='str')

Important

by doing this ,it will not change the original DataFrame df_orders. If you want to change the original DataFrame, you can either assign the result back to the original DataFrame or use the inplace=True parameter.

df_orders.drop(columns = ['eval_set'], inplace=True)

df_orders.head()

Renaming columns

The function itself is df.rename(), and within its parentheses comes the argument columns, which expects a dictionary. The dictionary should map old column names to new column names. To rename the eval_set column to dataset_type in your orders.csv dataframe, the full function would be:

Create a new DataFrame with the renamed column

Option 1:

df_orders_new = df_orders.rename(columns={"eval_set": "dataset_type"})
df_orders_new.head()

	order_id	user_id	dataset_type	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

Option 2:

df_orders.rename(columns = {'eval_set': 'dataset_type'})

	order_id	user_id	dataset_type	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0
...	...	...	...	...	...	...	...
3421078	2266710	206209	prior	10	5	18	29.0
3421079	1854736	206209	prior	11	4	10	30.0
3421080	626363	206209	prior	12	1	12	18.0
3421081	2977660	206209	prior	13	1	12	7.0
3421082	272231	206209	train	14	6	14	30.0

3421083 rows × 7 columns

df_orders.columns

Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='str')

Important

By doing this, it will not change the original DataFrame df_orders. If you want to change the original DataFrame, you can either assign the result back to the original DataFrame or use the inplace=True parameter.

df_orders.rename(columns = {'eval_set': 'dataset_type'}, inplace=True)

df_orders.head()

	order_id	user_id	dataset_type	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0

Changing data types

The function itself is df.astype(), and within its parentheses comes a dictionary that specifies the column name and the target data type. To change the order_id column to integer type in your orders.csv dataframe, the full function would be:

Create a new DataFrame with updated data types

Option 1:

df_orders_new = df_orders.astype({"order_id": "int64"})

Option 2:

df_orders.astype({'order_id': 'int64'})

	order_id	user_id	dataset_type	order_number	order_dow	order_hour_of_day	days_since_prior_order
0	2539329	1	prior	1	2	8	NaN
1	2398795	1	prior	2	3	7	15.0
2	473747	1	prior	3	3	12	21.0
3	2254736	1	prior	4	4	7	29.0
4	431534	1	prior	5	4	15	28.0
...	...	...	...	...	...	...	...
3421078	2266710	206209	prior	10	5	18	29.0
3421079	1854736	206209	prior	11	4	10	30.0
3421080	626363	206209	prior	12	1	12	18.0
3421081	2977660	206209	prior	13	1	12	7.0
3421082	272231	206209	train	14	6	14	30.0

3421083 rows × 7 columns

df_orders.dtypes

order_id                    int64
user_id                     int64
dataset_type                  str
order_number                int64
order_dow                   int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

Important

df_orders['order_id'] = df_orders['order_id'].astype('int64')

df_orders.dtypes

order_id                    int64
user_id                     int64
dataset_type                  str
order_number                int64
order_dow                   int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

Transforming data (e.g., creating new columns, applying functions to columns)

You can create new columns or transform existing ones using vectorized operations or the apply() function. Using the columns from your orders DataFrame (order_id, user_id, eval_set, order_number, order_dow, order_hour_of_day, days_since_prior_order), we will now create meaningful analytical features.

Create a new DataFrame with transformed data

Option 1:

df_orders_new = df_orders.assign(
    is_weekend = df_orders["order_dow"].isin([0, 6]),
    is_morning = df_orders["order_hour_of_day"] < 12
)

Option 2:

df_orders["is_weekend"] = df_orders["order_dow"].isin([0, 6])
df_orders["is_morning"] = df_orders["order_hour_of_day"] < 12

df_orders.head()

	order_id	user_id	dataset_type	order_number	order_dow	order_hour_of_day	days_since_prior_order	is_weekend	is_morning
0	2539329	1	prior	1	2	8	NaN	False	True
1	2398795	1	prior	2	3	7	15.0	False	True
2	473747	1	prior	3	3	12	21.0	False	False
3	2254736	1	prior	4	4	7	29.0	False	True
4	431534	1	prior	5	4	15	28.0	False	False

Example: Creating an order frequency category based on order_number

df_orders["order_frequency_category"] = df_orders["order_number"].apply(
    lambda x: "New" if x == 1 
    else "Low" if x <= 5 
    else "High"
)

df_orders.head()

	order_id	user_id	dataset_type	order_number	order_dow	order_hour_of_day	days_since_prior_order	is_weekend	is_morning	order_frequency_category
0	2539329	1	prior	1	2	8	NaN	False	True	New
1	2398795	1	prior	2	3	7	15.0	False	True	Low
2	473747	1	prior	3	3	12	21.0	False	False	Low
3	2254736	1	prior	4	4	7	29.0	False	True	Low
4	431534	1	prior	5	4	15	28.0	False	False	Low

Example: Handling missing values in days_since_prior_order

df_orders["days_since_prior_order"] = df_orders["days_since_prior_order"].fillna(0)

df_orders.head()

	order_id	user_id	dataset_type	order_number	order_dow	order_hour_of_day	days_since_prior_order	is_weekend	is_morning	order_frequency_category
0	2539329	1	prior	1	2	8	0.0	False	True	New
1	2398795	1	prior	2	3	7	15.0	False	True	Low
2	473747	1	prior	3	3	12	21.0	False	False	Low
3	2254736	1	prior	4	4	7	29.0	False	True	Low
4	431534	1	prior	5	4	15	28.0	False	False	Low

Example: Creating time-of-day buckets

df_orders["time_bucket"] = df_orders["order_hour_of_day"].apply(
    lambda x: "Morning" if 6 <= x < 12
    else "Afternoon" if 12 <= x < 18
    else "Evening"
)

df_orders.head()

	order_id	user_id	dataset_type	order_number	order_dow	order_hour_of_day	days_since_prior_order	is_weekend	is_morning	order_frequency_category	time_bucket
0	2539329	1	prior	1	2	8	0.0	False	True	New	Morning
1	2398795	1	prior	2	3	7	15.0	False	True	Low	Morning
2	473747	1	prior	3	3	12	21.0	False	False	Low	Afternoon
3	2254736	1	prior	4	4	7	29.0	False	True	Low	Morning
4	431534	1	prior	5	4	15	28.0	False	False	Low	Afternoon

Important

Vectorized operations (like comparisons and arithmetic directly on columns) are preferred over apply() whenever possible because they are faster and more memory-efficient.

Transponsing a DateFrame

Transposing refers to turning your dataframe’s rows into columns, and vice versa. This is also known as changing your data from wide format into long format. Let’s take a look at how to do this now in Python as it often involves more steps than you’d anticipate!

df_departments = pd.read_csv("../data/raw/departments.csv")
df_departments.head()

	department_id	1	2	3	4	5	6	7	8	9	...	12	13	14	15	16	17	18	19	20	21
0	department	frozen	other	bakery	produce	alcohol	international	beverages	pets	dry goods pasta	...	meat seafood	pantry	breakfast	canned goods	dairy eggs	household	babies	snacks	deli	missing

1 rows × 22 columns

This is a strange-looking dataframe, isn’t it? There’s only 1 row and 22 columns, making the dataframe incredibly wide and incredibly short:

df_departments.T

	0
department_id	department
1	frozen
2	other
3	bakery
4	produce
5	alcohol
6	international
7	beverages
8	pets
9	dry goods pasta
10	bulk
11	personal care
12	meat seafood
13	pantry
14	breakfast
15	canned goods
16	dairy eggs
17	household
18	babies
19	snacks
20	deli
21	missing

df_dep_t = df_departments.T

Now let’s add the column names back to the transposed dataframe:

df_dep_t.reset_index(inplace=True)

new_header = df_dep_t.iloc[0]

df_dep_t_new = df_dep_t[1:]

Setting the new header:

df_dep_t_new.columns = new_header   

df_dep_t_new.head()

	department_id	department
1	1	frozen
2	2	other
3	3	bakery
4	4	produce
5	5	alcohol

Exporting the data

After cleaning and transforming data, we often export it.

Exprting to CSV

df_orders.to_csv("../data/processed/orders_cleaned.csv", index=False)

df_products.to_csv("../data/processed/products_cleaned.csv", index=False)

df_dep_t_new.to_csv("../data/processed/departments_t.csv", index=False)

Important

When exporting data, always set index=False to avoid including the row indices in the output file, if existing indices are not meaningful for the analysis.

Try without index=False and see what happens. You will get an extra column with the row indices, which is not needed in most cases.

Understanding Relative Paths in Python

What is a File Path?

A file path tells Python where to find a file or folder.

Absolute Path: A complete path from the root directory
Example:
C:/Users/YourName/Project/data/file.xlsx
Relative Path: A path relative to the current working directory
Example:
data/file.xlsx

Why Use Relative Paths?

Makes your code portable
Works across different computers or environments
Avoids hardcoding full file system paths

Option A: `os.path`

import os

# Get current working directory
print("Current working directory:", os.getcwd())

# Join a relative path
file_path = os.path.join("data", "file.xlsx")
print("Relative path:", file_path)

Folder Structure

project/
├── data/
│   └── sample.xlsx
└── scripts/
    └── main.py
    └── main.ipynb

main.ipynb is inside the notebooks/ folder, so it needs to go up one level (..) to access the data/ folder.

Option B: Read the CSV file using a relative path

import pandas as pd
from pathlib import Path

# Define relative path to the data folder from scripts/
file_path = Path("../data") / "sample.csv"

# Read the CSV file
df = pd.read_csv(file_path)

# Display the first few rows
print(df.head())

Pandas Cheat Sheet

Important

Here you may find practically all the pandas functionality you are going to need for Data Analytics.

Environment Setup

Unzipping the files

Core Data Structures

Series Example

DataFrame Example

Importing the data

Importing orders.csv file

Importing products.csv file

Common parameters

Exploring Data with Pandas

Top N rows of the data

Top N rows of the data

Column names of the DataFrame

Get Columns

Data types of the columns

Summary statistics of the data

Data Wrangling and Subsetting

Data Wrangling Procedures

Dropping columns

Option 1:

Option 2:

Renaming columns

Option 1:

Option 2:

Changing data types

Option 1:

Option 2:

Transforming data (e.g., creating new columns, applying functions to columns)

Option 1:

Option 2:

Transponsing a DateFrame

Exporting the data

Exprting to CSV

Understanding Relative Paths in Python

What is a File Path?

Why Use Relative Paths?

Option A: os.path

Folder Structure

Option B: Read the CSV file using a relative path

Pandas Cheat Sheet

Importing `orders.csv` file

Importing `products.csv` file

Top `N` rows of the data

Top `N` rows of the data

Option A: `os.path`