Data Analytics Bootcamp
  • Syllabus
  • Statistical Thinking
  • SQL
  • Python
  • Tableau
  • Lab
  • Capstone
  1. Python
  2. Python
  3. Session 03: Introduction to Pandas
  • Syllabus
  • Statistical Thinking
    • Statistics
      • Statistics Session 01: Data Layers and Bias in Data
      • Statistics Session 02: Data Types
      • Statistics Session 03: Probabilistic Distributions
      • Statistics Session 04: Probabilistic Distributions
      • Statistics Session 05: Sampling
      • Statistics Session 06: Inferential Statistics
      • Slides
        • Course Intro
        • Descriptive Stats
        • Data Types
        • Continuous Distributions
        • Discrete Distributions
        • Sampling
        • Hypothesis Testing
  • SQL
    • SQL
      • Session 01: Intro to Relational Databases
      • Session 02: Intro to PostgreSQL
      • Session 03: DA with SQL | Data Types & Constraints
      • Session 04: DA with SQL | Filtering
      • Session 05: DA with SQL | Numeric Functions
      • Session 06: DA with SQL | String Functions
      • Session 07: DA with SQL | Date Functions
      • Session 08: DA with SQL | JOINs
      • Session 09: DA with SQL | Advanced SQL
      • Session 10: DA with SQL | Advanced SQL Functions
      • Session 11: DA with SQL | UDFs, Stored Procedures
      • Session 12: DA with SQL | Advanced Aggregations
      • Session 13: DA with SQL | Final Project
      • Slides
        • Intro to Relational Databases
        • Intro to PostgreSQL
        • Basic Queries: DDL DLM
        • Filtering
        • Numeric Functions
        • String Functions
        • Date Functions
        • Normalization and JOINs
        • Temporary Tables
        • Advanced SQL Functions
        • Reporting and Analysis with SQL
        • Advanced Aggregations
  • Python
    • Python
      • Session 01: Programming for Data Analysts
      • Session 02: Python basic Syntax, Data Structures
      • Session 03: Introduction to Pandas
      • Session 04: Advanced Pandas
      • Session 05: Intro to Data Visualization
      • Session 06: Data Visualization
      • Session 07: Working with Dates
      • Session 08: Data Visualization | Plotly
      • Session 09: Customer Segmentation | RFM
      • Session 10: A/B Testing
      • Slides
        • Data Analyst
  • Tableau
    • Tableau
      • Tableau Session 01: Introduction to Tableau
      • Tableau Session 02: Intermediate Visual Analytics
      • Tableau Session 03: Advanced Analytics
      • Tableau Session 04: Dashboard Design & Performance
      • Slides
        • Data Analyst
        • Data Analyst
        • Data Analyst
        • Data Analyst

On this page

  • Environment Setup
  • Unzipping the files
  • Core Data Structures
    • Series Example
    • DataFrame Example
  • Importing the data
    • Importing orders.csv file
    • Importing products.csv file
    • Common parameters
  • Exploring Data with Pandas
    • Top N rows of the data
    • Top N rows of the data
    • Column names of the DataFrame
    • Get Columns
    • Data types of the columns
    • Summary statistics of the data
  • Data Wrangling and Subsetting
    • Data Wrangling Procedures
    • Dropping columns
    • Renaming columns
    • Changing data types
    • Transforming data (e.g., creating new columns, applying functions to columns)
    • Transponsing a DateFrame
  • Exporting the data
    • Exprting to CSV
  • Understanding Relative Paths in Python
    • What is a File Path?
    • Why Use Relative Paths?
    • Option A: os.path
    • Option B: Read the CSV file using a relative path
  • Pandas Cheat Sheet
  1. Python
  2. Python
  3. Session 03: Introduction to Pandas

Session 03: Introduction to Pandas

Pandas
DataFrames

Environment Setup

  1. Create 04_data_import.ipynb file in the notebooks folder.
  2. Import the necessary libraries: pandas and numpy.
  3. Dowload the dataset from this link and save it in the data/raw folder as Archive.zip
  4. Unzipping the file you will get 2 csv files:
    1. orders.csv
    2. products.csv

Unzipping the files

There are 3 options to unzip the file:

  1. usiing the terminal
  2. using the file explorer
  3. using Python code

Guess which one we will use in this course? Yes, you are right! We will use Python code to unzip the file. :)

import zipfile
with zipfile.ZipFile("../data/raw/Archive.zip", "r") as zip_ref:
    zip_ref.extractall("../data/raw")

As you can see, we are using the zipfile library to unzip the file. We are opening the zip file in read r mode and then extracting all the files to the specified location.

Warning__MACOSX

If you are using MacOS, after the extraction you might see other folder as named __MACOSX. It is a metadate and you should remove it.

CHALLANGE: Try to delete the __MACOSX file if exists.


Important

Pay attention to the size of the of the orders.csv file. It is around 109 GB. So, it will not be possible to push it to GitHub. We will use only a sample of the data for our analysis.

We can either:

  1. Create a sample of the data using Python code and save it as a new csv file.
  2. Push zip file to GitHub and unzip it locally on our machines (just like we did in the code above) and then create a sample of the data using Python code and save it as a new csv file.

Core Data Structures

Pandas has two primary data structures:

  • Series: 1-dimensional labeled array
  • DataFrame: 2-dimensional labeled table

Series Example

import pandas as pd

s = pd.Series([10, 20, 30, 40])
s
0    10
1    20
2    30
3    40
dtype: int64

A Series consists of:

  • Values
  • Index

DataFrame Example

data = {
    "customer": ["Anna", "David", "Liza"],
    "revenue": [100, 250, 180]
}

df = pd.DataFrame(data)
df
customer revenue
0 Anna 100
1 David 250
2 Liza 180

A DataFrame consists of:

  • Rows (index)
  • Columns
  • Values

In mathematical form, a DataFrame can be represented as a matrix:

\[ X = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix} \]

Where:

  • \(n\) = number of observations (rows)
  • \(p\) = number of variables (columns)

Importing the data

Now that we have unzipped the files, we can import them into our Jupyter Notebook using the pandas library.

TipWhy CSV files?

We will learn importing other formats of data in the next sessions. For now, we will focus on importing csv files, which is the most common format for storing tabular data.

Importing orders.csv file

import pandas as pd

df_orders = pd.read_csv("../data/raw/orders.csv")
df_orders.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0
Important

try to make a heabit of using head() method after importing the data to check if the data is imported correctly and to get a quick overview of the data.

Importing products.csv file

df_products = pd.read_csv("../data/raw/products.csv")
df_products.head()
product_id product_name aisle_id department_id prices
0 1 Chocolate Sandwich Cookies 61 19 5.8
1 2 All-Seasons Salt 104 13 9.3
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.5
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.5
4 5 Green Chile Anytime Sauce 5 13 4.3

Common parameters

pd.read_csv(
    "{data/sales.csv}",
    sep=",",
    header=0,
    na_values=["NA", ""],
    parse_dates=["order_date"]
)

Exploring Data with Pandas

Top N rows of the data

Default value of N is 5, so if you do not specify the number of rows to display, it will show you the top 5 rows of the DataFrame.

df_products.head()
product_id product_name aisle_id department_id prices
0 1 Chocolate Sandwich Cookies 61 19 5.8
1 2 All-Seasons Salt 104 13 9.3
2 3 Robust Golden Unsweetened Oolong Tea 94 7 4.5
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1 10.5
4 5 Green Chile Anytime Sauce 5 13 4.3
df_orders.head()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0
Note

You can specify the number of rows to display by passing an integer to the head() method. For example, df.head(10) will display the top 10 rows of the DataFrame.

Top N rows of the data

df_products.tail()
product_id product_name aisle_id department_id prices
49688 49684 Vodka, Triple Distilled, Twist of Vanilla 124 5 5.3
49689 49685 En Croute Roast Hazelnut Cranberry 42 1 3.1
49690 49686 Artisan Baguette 112 3 7.8
49691 49687 Smartblend Healthy Metabolism Dry Cat Food 41 8 4.7
49692 49688 Fresh Foaming Cleanser 73 11 13.5
df_orders.tail()
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
3421078 2266710 206209 prior 10 5 18 29.0
3421079 1854736 206209 prior 11 4 10 30.0
3421080 626363 206209 prior 12 1 12 18.0
3421081 2977660 206209 prior 13 1 12 7.0
3421082 272231 206209 train 14 6 14 30.0

From this simple overview of the data can see the strucutre and column names of each DataFrame.

Column names of the DataFrame

  • Orders [‘order_id’, ‘user_id’, ‘eval_set’, ‘order_number’, ‘order_dow’, ‘order_hour_of_day’, ‘days_since_prior_order’] columns
  • Products [‘product_id’, ‘product_name’, ‘aisle_id’, ‘department_id’, ‘prices’] columns

Get Columns

Although we can get the column names using list(df.columns), it is more common to use df.columns to get the column names as an Index object.

df_orders.columns
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='str')
df_products.columns
Index(['product_id', 'product_name', 'aisle_id', 'department_id', 'prices'], dtype='str')

Data types of the columns

Another handy function is the df.info() function. It returns some basic information about your dataframe, for instance, how many rows and columns it has, what the columns are called, and what data types the columns contain.

df_orders.info()
<class 'pandas.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                str    
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), str(1)
memory usage: 198.9 MB
df_orders.info()
<class 'pandas.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
 #   Column                  Dtype  
---  ------                  -----  
 0   order_id                int64  
 1   user_id                 int64  
 2   eval_set                str    
 3   order_number            int64  
 4   order_dow               int64  
 5   order_hour_of_day       int64  
 6   days_since_prior_order  float64
dtypes: float64(1), int64(5), str(1)
memory usage: 198.9 MB

There’s also a separate dedicated function just for checking the data types of a dataframe’s columns. If you only want to check the data types (and nothing else), you can use the df.dtypes function for a cleaner output:

df_orders.dtypes

Summary statistics of the data

With just one function, you can automatically generate all those descriptive statistics you should be well familiar with by this point:

  • count
  • mean
  • standard deviation
  • minimum
df_orders.describe()
order_id user_id order_number order_dow order_hour_of_day days_since_prior_order
count 3.421083e+06 3.421083e+06 3.421083e+06 3.421083e+06 3.421083e+06 3.214874e+06
mean 1.710542e+06 1.029782e+05 1.715486e+01 2.776219e+00 1.345202e+01 1.111484e+01
std 9.875817e+05 5.953372e+04 1.773316e+01 2.046829e+00 4.226088e+00 9.206737e+00
min 1.000000e+00 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
25% 8.552715e+05 5.139400e+04 5.000000e+00 1.000000e+00 1.000000e+01 4.000000e+00
50% 1.710542e+06 1.026890e+05 1.100000e+01 3.000000e+00 1.300000e+01 7.000000e+00
75% 2.565812e+06 1.543850e+05 2.300000e+01 5.000000e+00 1.600000e+01 1.500000e+01
max 3.421083e+06 2.062090e+05 1.000000e+02 6.000000e+00 2.300000e+01 3.000000e+01
Tip
  • Compare this method with typing and copying formulas in Excel!
  • Compare this method with writing SQL queries to get the same information.

Data Wrangling and Subsetting

The term “data wrangling” is just a fancy term for conducting manipulations or transformations on data. When you’re working with large sets of data, it can be difficult to see all the data entries and values in your dataframe. This is where Python (and pandas!) comes in.

Data Wrangling Procedures

  • Dropping columns
  • Renaming columns
  • Changing data types
  • Transforming data (e.g., creating new columns, applying functions to columns)

Dropping columns

The function itself is df.drop(), and within its parentheses comes the argument, which, in this case, is a list. The list should contain the individual columns that you want to drop, enclosed in quotation marks. To remove the eval_set column from your orders.csv dataframe, the full function would be:

Create a new DataFrame without the eval_set column

Option 1:

df_orders_new = df_orders.drop(columns=["eval_set"])

Option 2:

df_orders.drop(columns = ['eval_set'])
order_id user_id order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 1 2 8 NaN
1 2398795 1 2 3 7 15.0
2 473747 1 3 3 12 21.0
3 2254736 1 4 4 7 29.0
4 431534 1 5 4 15 28.0
... ... ... ... ... ... ...
3421078 2266710 206209 10 5 18 29.0
3421079 1854736 206209 11 4 10 30.0
3421080 626363 206209 12 1 12 18.0
3421081 2977660 206209 13 1 12 7.0
3421082 272231 206209 14 6 14 30.0

3421083 rows × 6 columns

df_orders.columns
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='str')
Important

by doing this ,it will not change the original DataFrame df_orders. If you want to change the original DataFrame, you can either assign the result back to the original DataFrame or use the inplace=True parameter.

df_orders.drop(columns = ['eval_set'], inplace=True)

df_orders.head()

Renaming columns

The function itself is df.rename(), and within its parentheses comes the argument columns, which expects a dictionary. The dictionary should map old column names to new column names. To rename the eval_set column to dataset_type in your orders.csv dataframe, the full function would be:

Create a new DataFrame with the renamed column

Option 1:

df_orders_new = df_orders.rename(columns={"eval_set": "dataset_type"})
df_orders_new.head()
order_id user_id dataset_type order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0

Option 2:

df_orders.rename(columns = {'eval_set': 'dataset_type'})
order_id user_id dataset_type order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0
... ... ... ... ... ... ... ...
3421078 2266710 206209 prior 10 5 18 29.0
3421079 1854736 206209 prior 11 4 10 30.0
3421080 626363 206209 prior 12 1 12 18.0
3421081 2977660 206209 prior 13 1 12 7.0
3421082 272231 206209 train 14 6 14 30.0

3421083 rows × 7 columns

df_orders.columns
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
       'order_hour_of_day', 'days_since_prior_order'],
      dtype='str')
Important

By doing this, it will not change the original DataFrame df_orders. If you want to change the original DataFrame, you can either assign the result back to the original DataFrame or use the inplace=True parameter.

df_orders.rename(columns = {'eval_set': 'dataset_type'}, inplace=True)

df_orders.head()
order_id user_id dataset_type order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0

Changing data types

The function itself is df.astype(), and within its parentheses comes a dictionary that specifies the column name and the target data type. To change the order_id column to integer type in your orders.csv dataframe, the full function would be:

Create a new DataFrame with updated data types

Option 1:

df_orders_new = df_orders.astype({"order_id": "int64"})

Option 2:

df_orders.astype({'order_id': 'int64'})
order_id user_id dataset_type order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
3 2254736 1 prior 4 4 7 29.0
4 431534 1 prior 5 4 15 28.0
... ... ... ... ... ... ... ...
3421078 2266710 206209 prior 10 5 18 29.0
3421079 1854736 206209 prior 11 4 10 30.0
3421080 626363 206209 prior 12 1 12 18.0
3421081 2977660 206209 prior 13 1 12 7.0
3421082 272231 206209 train 14 6 14 30.0

3421083 rows × 7 columns

df_orders.dtypes
order_id                    int64
user_id                     int64
dataset_type                  str
order_number                int64
order_dow                   int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object
Important

By doing this, it will not change the original DataFrame df_orders. If you want to change the original DataFrame, you can either assign the result back to the original DataFrame or overwrite the column directly.

df_orders['order_id'] = df_orders['order_id'].astype('int64')

df_orders.dtypes
order_id                    int64
user_id                     int64
dataset_type                  str
order_number                int64
order_dow                   int64
order_hour_of_day           int64
days_since_prior_order    float64
dtype: object

Transforming data (e.g., creating new columns, applying functions to columns)

You can create new columns or transform existing ones using vectorized operations or the apply() function. Using the columns from your orders DataFrame (order_id, user_id, eval_set, order_number, order_dow, order_hour_of_day, days_since_prior_order), we will now create meaningful analytical features.


Create a new DataFrame with transformed data

Option 1:

df_orders_new = df_orders.assign(
    is_weekend = df_orders["order_dow"].isin([0, 6]),
    is_morning = df_orders["order_hour_of_day"] < 12
)

Option 2:

df_orders["is_weekend"] = df_orders["order_dow"].isin([0, 6])
df_orders["is_morning"] = df_orders["order_hour_of_day"] < 12

df_orders.head()
order_id user_id dataset_type order_number order_dow order_hour_of_day days_since_prior_order is_weekend is_morning
0 2539329 1 prior 1 2 8 NaN False True
1 2398795 1 prior 2 3 7 15.0 False True
2 473747 1 prior 3 3 12 21.0 False False
3 2254736 1 prior 4 4 7 29.0 False True
4 431534 1 prior 5 4 15 28.0 False False

Example: Creating an order frequency category based on order_number

df_orders["order_frequency_category"] = df_orders["order_number"].apply(
    lambda x: "New" if x == 1 
    else "Low" if x <= 5 
    else "High"
)

df_orders.head()
order_id user_id dataset_type order_number order_dow order_hour_of_day days_since_prior_order is_weekend is_morning order_frequency_category
0 2539329 1 prior 1 2 8 NaN False True New
1 2398795 1 prior 2 3 7 15.0 False True Low
2 473747 1 prior 3 3 12 21.0 False False Low
3 2254736 1 prior 4 4 7 29.0 False True Low
4 431534 1 prior 5 4 15 28.0 False False Low

Example: Handling missing values in days_since_prior_order

df_orders["days_since_prior_order"] = df_orders["days_since_prior_order"].fillna(0)

df_orders.head()
order_id user_id dataset_type order_number order_dow order_hour_of_day days_since_prior_order is_weekend is_morning order_frequency_category
0 2539329 1 prior 1 2 8 0.0 False True New
1 2398795 1 prior 2 3 7 15.0 False True Low
2 473747 1 prior 3 3 12 21.0 False False Low
3 2254736 1 prior 4 4 7 29.0 False True Low
4 431534 1 prior 5 4 15 28.0 False False Low

Example: Creating time-of-day buckets

df_orders["time_bucket"] = df_orders["order_hour_of_day"].apply(
    lambda x: "Morning" if 6 <= x < 12
    else "Afternoon" if 12 <= x < 18
    else "Evening"
)

df_orders.head()
order_id user_id dataset_type order_number order_dow order_hour_of_day days_since_prior_order is_weekend is_morning order_frequency_category time_bucket
0 2539329 1 prior 1 2 8 0.0 False True New Morning
1 2398795 1 prior 2 3 7 15.0 False True Low Morning
2 473747 1 prior 3 3 12 21.0 False False Low Afternoon
3 2254736 1 prior 4 4 7 29.0 False True Low Morning
4 431534 1 prior 5 4 15 28.0 False False Low Afternoon
Important

Vectorized operations (like comparisons and arithmetic directly on columns) are preferred over apply() whenever possible because they are faster and more memory-efficient.

Transponsing a DateFrame

Transposing refers to turning your dataframe’s rows into columns, and vice versa. This is also known as changing your data from wide format into long format. Let’s take a look at how to do this now in Python as it often involves more steps than you’d anticipate!

df_departments = pd.read_csv("../data/raw/departments.csv")
df_departments.head()
department_id 1 2 3 4 5 6 7 8 9 ... 12 13 14 15 16 17 18 19 20 21
0 department frozen other bakery produce alcohol international beverages pets dry goods pasta ... meat seafood pantry breakfast canned goods dairy eggs household babies snacks deli missing

1 rows × 22 columns

This is a strange-looking dataframe, isn’t it? There’s only 1 row and 22 columns, making the dataframe incredibly wide and incredibly short:

df_departments.T
0
department_id department
1 frozen
2 other
3 bakery
4 produce
5 alcohol
6 international
7 beverages
8 pets
9 dry goods pasta
10 bulk
11 personal care
12 meat seafood
13 pantry
14 breakfast
15 canned goods
16 dairy eggs
17 household
18 babies
19 snacks
20 deli
21 missing
df_dep_t = df_departments.T

Now let’s add the column names back to the transposed dataframe:

df_dep_t.reset_index(inplace=True)
new_header = df_dep_t.iloc[0]
df_dep_t_new = df_dep_t[1:]

Setting the new header:

df_dep_t_new.columns = new_header   

df_dep_t_new.head()
department_id department
1 1 frozen
2 2 other
3 3 bakery
4 4 produce
5 5 alcohol

Exporting the data

After cleaning and transforming data, we often export it.

Exprting to CSV

df_orders.to_csv("../data/processed/orders_cleaned.csv", index=False)
df_products.to_csv("../data/processed/products_cleaned.csv", index=False)
df_dep_t_new.to_csv("../data/processed/departments_t.csv", index=False)
Important

When exporting data, always set index=False to avoid including the row indices in the output file, if existing indices are not meaningful for the analysis.

Try without index=False and see what happens. You will get an extra column with the row indices, which is not needed in most cases.

Understanding Relative Paths in Python

What is a File Path?

A file path tells Python where to find a file or folder.

  • Absolute Path: A complete path from the root directory
    Example:
    C:/Users/YourName/Project/data/file.xlsx

  • Relative Path: A path relative to the current working directory
    Example:
    data/file.xlsx


Why Use Relative Paths?

  • Makes your code portable
  • Works across different computers or environments
  • Avoids hardcoding full file system paths

Option A: os.path

import os

# Get current working directory
print("Current working directory:", os.getcwd())

# Join a relative path
file_path = os.path.join("data", "file.xlsx")
print("Relative path:", file_path)

Folder Structure

project/
├── data/
│   └── sample.xlsx
└── scripts/
    └── main.py
    └── main.ipynb

main.ipynb is inside the notebooks/ folder, so it needs to go up one level (..) to access the data/ folder.


Option B: Read the CSV file using a relative path

import pandas as pd
from pathlib import Path

# Define relative path to the data folder from scripts/
file_path = Path("../data") / "sample.csv"

# Read the CSV file
df = pd.read_csv(file_path)

# Display the first few rows
print(df.head())

Pandas Cheat Sheet

Important

Here you may find practically all the pandas functionality you are going to need for Data Analytics.