import pandas as pd
s = pd.Series([10, 20, 30, 40])
s0 10
1 20
2 30
3 40
dtype: int64

04_data_import.ipynb file in the notebooks folder.pandas and numpy.data/raw folder as Archive.ziporders.csvproducts.csvThere are 3 options to unzip the file:
Guess which one we will use in this course? Yes, you are right! We will use Python code to unzip the file. :)
As you can see, we are using the zipfile library to unzip the file. We are opening the zip file in read r mode and then extracting all the files to the specified location.
If you are using MacOS, after the extraction you might see other folder as named __MACOSX. It is a metadate and you should remove it.
CHALLANGE: Try to delete the
__MACOSXfile if exists.
Pay attention to the size of the of the orders.csv file. It is around 109 GB. So, it will not be possible to push it to GitHub. We will use only a sample of the data for our analysis.
We can either:
Pandas has two primary data structures:
A Series consists of:
| customer | revenue | |
|---|---|---|
| 0 | Anna | 100 |
| 1 | David | 250 |
| 2 | Liza | 180 |
A DataFrame consists of:
In mathematical form, a DataFrame can be represented as a matrix:
\[ X = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix} \]
Where:
Now that we have unzipped the files, we can import them into our Jupyter Notebook using the pandas library.
We will learn importing other formats of data in the next sessions. For now, we will focus on importing csv files, which is the most common format for storing tabular data.
orders.csv file| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
try to make a heabit of using head() method after importing the data to check if the data is imported correctly and to get a quick overview of the data.
products.csv file| product_id | product_name | aisle_id | department_id | prices | |
|---|---|---|---|---|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 | 5.8 |
| 1 | 2 | All-Seasons Salt | 104 | 13 | 9.3 |
| 2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 | 4.5 |
| 3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 | 10.5 |
| 4 | 5 | Green Chile Anytime Sauce | 5 | 13 | 4.3 |
N rows of the dataDefault value of N is 5, so if you do not specify the number of rows to display, it will show you the top 5 rows of the DataFrame.
| product_id | product_name | aisle_id | department_id | prices | |
|---|---|---|---|---|---|
| 0 | 1 | Chocolate Sandwich Cookies | 61 | 19 | 5.8 |
| 1 | 2 | All-Seasons Salt | 104 | 13 | 9.3 |
| 2 | 3 | Robust Golden Unsweetened Oolong Tea | 94 | 7 | 4.5 |
| 3 | 4 | Smart Ones Classic Favorites Mini Rigatoni Wit... | 38 | 1 | 10.5 |
| 4 | 5 | Green Chile Anytime Sauce | 5 | 13 | 4.3 |
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
You can specify the number of rows to display by passing an integer to the head() method. For example, df.head(10) will display the top 10 rows of the DataFrame.
N rows of the data| product_id | product_name | aisle_id | department_id | prices | |
|---|---|---|---|---|---|
| 49688 | 49684 | Vodka, Triple Distilled, Twist of Vanilla | 124 | 5 | 5.3 |
| 49689 | 49685 | En Croute Roast Hazelnut Cranberry | 42 | 1 | 3.1 |
| 49690 | 49686 | Artisan Baguette | 112 | 3 | 7.8 |
| 49691 | 49687 | Smartblend Healthy Metabolism Dry Cat Food | 41 | 8 | 4.7 |
| 49692 | 49688 | Fresh Foaming Cleanser | 73 | 11 | 13.5 |
| order_id | user_id | eval_set | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 3421078 | 2266710 | 206209 | prior | 10 | 5 | 18 | 29.0 |
| 3421079 | 1854736 | 206209 | prior | 11 | 4 | 10 | 30.0 |
| 3421080 | 626363 | 206209 | prior | 12 | 1 | 12 | 18.0 |
| 3421081 | 2977660 | 206209 | prior | 13 | 1 | 12 | 7.0 |
| 3421082 | 272231 | 206209 | train | 14 | 6 | 14 | 30.0 |
From this simple overview of the data can see the strucutre and column names of each DataFrame.
Although we can get the column names using list(df.columns), it is more common to use df.columns to get the column names as an Index object.
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
'order_hour_of_day', 'days_since_prior_order'],
dtype='str')
Another handy function is the df.info() function. It returns some basic information about your dataframe, for instance, how many rows and columns it has, what the columns are called, and what data types the columns contain.
<class 'pandas.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 order_id int64
1 user_id int64
2 eval_set str
3 order_number int64
4 order_dow int64
5 order_hour_of_day int64
6 days_since_prior_order float64
dtypes: float64(1), int64(5), str(1)
memory usage: 198.9 MB
<class 'pandas.DataFrame'>
RangeIndex: 3421083 entries, 0 to 3421082
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 order_id int64
1 user_id int64
2 eval_set str
3 order_number int64
4 order_dow int64
5 order_hour_of_day int64
6 days_since_prior_order float64
dtypes: float64(1), int64(5), str(1)
memory usage: 198.9 MB
There’s also a separate dedicated function just for checking the data types of a dataframe’s columns. If you only want to check the data types (and nothing else), you can use the df.dtypes function for a cleaner output:
With just one function, you can automatically generate all those descriptive statistics you should be well familiar with by this point:
| order_id | user_id | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|
| count | 3.421083e+06 | 3.421083e+06 | 3.421083e+06 | 3.421083e+06 | 3.421083e+06 | 3.214874e+06 |
| mean | 1.710542e+06 | 1.029782e+05 | 1.715486e+01 | 2.776219e+00 | 1.345202e+01 | 1.111484e+01 |
| std | 9.875817e+05 | 5.953372e+04 | 1.773316e+01 | 2.046829e+00 | 4.226088e+00 | 9.206737e+00 |
| min | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 8.552715e+05 | 5.139400e+04 | 5.000000e+00 | 1.000000e+00 | 1.000000e+01 | 4.000000e+00 |
| 50% | 1.710542e+06 | 1.026890e+05 | 1.100000e+01 | 3.000000e+00 | 1.300000e+01 | 7.000000e+00 |
| 75% | 2.565812e+06 | 1.543850e+05 | 2.300000e+01 | 5.000000e+00 | 1.600000e+01 | 1.500000e+01 |
| max | 3.421083e+06 | 2.062090e+05 | 1.000000e+02 | 6.000000e+00 | 2.300000e+01 | 3.000000e+01 |
Excel!SQL queries to get the same information.The term “data wrangling” is just a fancy term for conducting manipulations or transformations on data. When you’re working with large sets of data, it can be difficult to see all the data entries and values in your dataframe. This is where Python (and pandas!) comes in.
The function itself is df.drop(), and within its parentheses comes the argument, which, in this case, is a list. The list should contain the individual columns that you want to drop, enclosed in quotation marks. To remove the eval_set column from your orders.csv dataframe, the full function would be:
Create a new DataFrame without the eval_set column
| order_id | user_id | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | 5 | 4 | 15 | 28.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 3421078 | 2266710 | 206209 | 10 | 5 | 18 | 29.0 |
| 3421079 | 1854736 | 206209 | 11 | 4 | 10 | 30.0 |
| 3421080 | 626363 | 206209 | 12 | 1 | 12 | 18.0 |
| 3421081 | 2977660 | 206209 | 13 | 1 | 12 | 7.0 |
| 3421082 | 272231 | 206209 | 14 | 6 | 14 | 30.0 |
3421083 rows × 6 columns
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
'order_hour_of_day', 'days_since_prior_order'],
dtype='str')
by doing this ,it will not change the original DataFrame df_orders. If you want to change the original DataFrame, you can either assign the result back to the original DataFrame or use the inplace=True parameter.
The function itself is df.rename(), and within its parentheses comes the argument columns, which expects a dictionary. The dictionary should map old column names to new column names. To rename the eval_set column to dataset_type in your orders.csv dataframe, the full function would be:
Create a new DataFrame with the renamed column
| order_id | user_id | dataset_type | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
| order_id | user_id | dataset_type | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3421078 | 2266710 | 206209 | prior | 10 | 5 | 18 | 29.0 |
| 3421079 | 1854736 | 206209 | prior | 11 | 4 | 10 | 30.0 |
| 3421080 | 626363 | 206209 | prior | 12 | 1 | 12 | 18.0 |
| 3421081 | 2977660 | 206209 | prior | 13 | 1 | 12 | 7.0 |
| 3421082 | 272231 | 206209 | train | 14 | 6 | 14 | 30.0 |
3421083 rows × 7 columns
Index(['order_id', 'user_id', 'eval_set', 'order_number', 'order_dow',
'order_hour_of_day', 'days_since_prior_order'],
dtype='str')
By doing this, it will not change the original DataFrame df_orders. If you want to change the original DataFrame, you can either assign the result back to the original DataFrame or use the inplace=True parameter.
| order_id | user_id | dataset_type | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
The function itself is df.astype(), and within its parentheses comes a dictionary that specifies the column name and the target data type. To change the order_id column to integer type in your orders.csv dataframe, the full function would be:
Create a new DataFrame with updated data types
| order_id | user_id | dataset_type | order_number | order_dow | order_hour_of_day | days_since_prior_order | |
|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3421078 | 2266710 | 206209 | prior | 10 | 5 | 18 | 29.0 |
| 3421079 | 1854736 | 206209 | prior | 11 | 4 | 10 | 30.0 |
| 3421080 | 626363 | 206209 | prior | 12 | 1 | 12 | 18.0 |
| 3421081 | 2977660 | 206209 | prior | 13 | 1 | 12 | 7.0 |
| 3421082 | 272231 | 206209 | train | 14 | 6 | 14 | 30.0 |
3421083 rows × 7 columns
order_id int64
user_id int64
dataset_type str
order_number int64
order_dow int64
order_hour_of_day int64
days_since_prior_order float64
dtype: object
By doing this, it will not change the original DataFrame df_orders. If you want to change the original DataFrame, you can either assign the result back to the original DataFrame or overwrite the column directly.
You can create new columns or transform existing ones using vectorized operations or the apply() function. Using the columns from your orders DataFrame (order_id, user_id, eval_set, order_number, order_dow, order_hour_of_day, days_since_prior_order), we will now create meaningful analytical features.
Create a new DataFrame with transformed data
| order_id | user_id | dataset_type | order_number | order_dow | order_hour_of_day | days_since_prior_order | is_weekend | is_morning | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN | False | True |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 | False | True |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 | False | False |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 | False | True |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 | False | False |
Example: Creating an order frequency category based on order_number
| order_id | user_id | dataset_type | order_number | order_dow | order_hour_of_day | days_since_prior_order | is_weekend | is_morning | order_frequency_category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | NaN | False | True | New |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 | False | True | Low |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 | False | False | Low |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 | False | True | Low |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 | False | False | Low |
Example: Handling missing values in days_since_prior_order
| order_id | user_id | dataset_type | order_number | order_dow | order_hour_of_day | days_since_prior_order | is_weekend | is_morning | order_frequency_category | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | 0.0 | False | True | New |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 | False | True | Low |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 | False | False | Low |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 | False | True | Low |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 | False | False | Low |
Example: Creating time-of-day buckets
| order_id | user_id | dataset_type | order_number | order_dow | order_hour_of_day | days_since_prior_order | is_weekend | is_morning | order_frequency_category | time_bucket | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2539329 | 1 | prior | 1 | 2 | 8 | 0.0 | False | True | New | Morning |
| 1 | 2398795 | 1 | prior | 2 | 3 | 7 | 15.0 | False | True | Low | Morning |
| 2 | 473747 | 1 | prior | 3 | 3 | 12 | 21.0 | False | False | Low | Afternoon |
| 3 | 2254736 | 1 | prior | 4 | 4 | 7 | 29.0 | False | True | Low | Morning |
| 4 | 431534 | 1 | prior | 5 | 4 | 15 | 28.0 | False | False | Low | Afternoon |
Vectorized operations (like comparisons and arithmetic directly on columns) are preferred over apply() whenever possible because they are faster and more memory-efficient.
Transposing refers to turning your dataframe’s rows into columns, and vice versa. This is also known as changing your data from wide format into long format. Let’s take a look at how to do this now in Python as it often involves more steps than you’d anticipate!
| department_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | department | frozen | other | bakery | produce | alcohol | international | beverages | pets | dry goods pasta | ... | meat seafood | pantry | breakfast | canned goods | dairy eggs | household | babies | snacks | deli | missing |
1 rows × 22 columns
This is a strange-looking dataframe, isn’t it? There’s only 1 row and 22 columns, making the dataframe incredibly wide and incredibly short:
| 0 | |
|---|---|
| department_id | department |
| 1 | frozen |
| 2 | other |
| 3 | bakery |
| 4 | produce |
| 5 | alcohol |
| 6 | international |
| 7 | beverages |
| 8 | pets |
| 9 | dry goods pasta |
| 10 | bulk |
| 11 | personal care |
| 12 | meat seafood |
| 13 | pantry |
| 14 | breakfast |
| 15 | canned goods |
| 16 | dairy eggs |
| 17 | household |
| 18 | babies |
| 19 | snacks |
| 20 | deli |
| 21 | missing |
Now let’s add the column names back to the transposed dataframe:
Setting the new header:
After cleaning and transforming data, we often export it.
When exporting data, always set index=False to avoid including the row indices in the output file, if existing indices are not meaningful for the analysis.
Try without index=False and see what happens. You will get an extra column with the row indices, which is not needed in most cases.
A file path tells Python where to find a file or folder.
Absolute Path: A complete path from the root directory
Example:
C:/Users/YourName/Project/data/file.xlsx
Relative Path: A path relative to the current working directory
Example:
data/file.xlsx
os.path
main.ipynbis inside thenotebooks/folder, so it needs to go up one level (..) to access thedata/folder.
Here you may find practically all the pandas functionality you are going to need for Data Analytics.